-
-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS
#12368
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Note: this PR is slightly different from the one I have already tested internally. I'd like to test it a bit before its truly ready for review, but I didn't see an option to create this PR as a draft. |
VLLM_LOGITS_PROCESSOR_THREADS
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
dc3aa96
to
ac402d3
Compare
My local tests look good, ready for review! |
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
This makes sense to me. Do you happen to have any performance data you can share? Perhaps benchmarking with structured output and xgrammar? It would be nice to show the benefit concretely. There are some scripts that would help automate this -- benchmarks/benchmark_guided.py and benchmarks/benchmark_serving_guided.py. Also, I wonder about just making this the default behavior instead of requiring it to be enabled via a tunable. Allowing the threadpool to be adjusted seems fine, but if the benefit is noticeable with one of our in-tree logits processors (xgrammar, in particular), then putting it on by default may make sense. |
I can't share my raw numbers or traces, but I can say that we see a roughly 10% improvement in tok/s and req/s when using this feature in combination with our internal logits processor (and with request concurrency of ~50, and threadpool size 50).
I wanted to be conservative and not alter existing behavior. As is often the case with python multithreading, there are cases where I suspect this would hurt rather than help performance. For instance, to get the benefit from this PR with our internal logits processor logic I had to make some tweaks (a well placed cuda sync) to ensure that the logits processors in separate threads don't block eachother via cuda's memcpy lock. |
OK, thanks. If we're not able to show how it helps something in-tree, then I agree with having it off by default. Can you expand on the doc text to add some guidance on when it would be useful? Something like "useful when using custom logits processors that either (a) launch additional CUDA kernels or (b) do significant CPU-bound work while not holding the python GIL, or both." |
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM given this is limited to within logits_processor.py
and is disabled by default, thanks
…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> Signed-off-by: Felix Marty <felmarty@amd.com>
…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
@akeshet is there a minimal example on how to use logits processor with this environment variable? |
@alejopaullier96 The minimum example would be simply to |
…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> Signed-off-by: Linkun Chen <github@lkchen.net>
…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> Signed-off-by: saeediy <saidakbarp@gmail.com>
…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
This PR adds an environment variable
VLLM_LOGITS_PROCESSOR_THREADS
.If set, this will cause VLLM to use a threadpool of the given size to multithread (across sequences in a batch) its calls to the provided logits processors.
This can increase GPU utilization and decrease ITL in cases where batches are large and where logits processors either (a) launch additional CUDA kernels or (b) do significant CPU-bound work while not holding the python GIL, or both.