[Core] add and implement `VLLM_LOGITS_PROCESSOR_THREADS` #12368

akeshet · 2025-01-23T20:42:45Z

This PR adds an environment variable VLLM_LOGITS_PROCESSOR_THREADS.

If set, this will cause VLLM to use a threadpool of the given size to multithread (across sequences in a batch) its calls to the provided logits processors.

This can increase GPU utilization and decrease ITL in cases where batches are large and where logits processors either (a) launch additional CUDA kernels or (b) do significant CPU-bound work while not holding the python GIL, or both.

github-actions · 2025-01-23T20:42:56Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

akeshet · 2025-01-23T20:44:09Z

Note: this PR is slightly different from the one I have already tested internally. I'd like to test it a bit before its truly ready for review, but I didn't see an option to create this PR as a draft.

Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>

akeshet · 2025-01-23T21:06:13Z

Note: this PR is slightly different from the one I have already tested internally. I'd like to test it a bit before its truly ready for review, but I didn't see an option to create this PR as a draft.

My local tests look good, ready for review!

Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>

russellb · 2025-01-28T18:41:28Z

This makes sense to me. Do you happen to have any performance data you can share? Perhaps benchmarking with structured output and xgrammar? It would be nice to show the benefit concretely. There are some scripts that would help automate this -- benchmarks/benchmark_guided.py and benchmarks/benchmark_serving_guided.py.

Also, I wonder about just making this the default behavior instead of requiring it to be enabled via a tunable. Allowing the threadpool to be adjusted seems fine, but if the benefit is noticeable with one of our in-tree logits processors (xgrammar, in particular), then putting it on by default may make sense.

akeshet · 2025-01-28T18:55:09Z

This makes sense to me. Do you happen to have any performance data you can share?

I can't share my raw numbers or traces, but I can say that we see a roughly 10% improvement in tok/s and req/s when using this feature in combination with our internal logits processor (and with request concurrency of ~50, and threadpool size 50).

Also, I wonder about just making this the default behavior instead of requiring it to be enabled via a tunable. Allowing the threadpool to be adjusted seems fine, but if the benefit is noticeable with one of our in-tree logits processors (xgrammar, in particular), then putting it on by default may make sense.

I wanted to be conservative and not alter existing behavior. As is often the case with python multithreading, there are cases where I suspect this would hurt rather than help performance. For instance, to get the benefit from this PR with our internal logits processor logic I had to make some tweaks (a well placed cuda sync) to ensure that the logits processors in separate threads don't block eachother via cuda's memcpy lock.

russellb · 2025-01-28T19:39:54Z

This makes sense to me. Do you happen to have any performance data you can share?

I can't share my raw numbers or traces, but I can say that we see a roughly 10% improvement in tok/s and req/s when using this feature in combination with our internal logits processor (and with request concurrency of ~50, and threadpool size 50).

Also, I wonder about just making this the default behavior instead of requiring it to be enabled via a tunable. Allowing the threadpool to be adjusted seems fine, but if the benefit is noticeable with one of our in-tree logits processors (xgrammar, in particular), then putting it on by default may make sense.

I wanted to be conservative and not alter existing behavior. As is often the case with python multithreading, there are cases where I suspect this would hurt rather than help performance. For instance, to get the benefit from this PR with our internal logits processor logic I had to make some tweaks (a well placed cuda sync) to ensure that the logits processors in separate threads don't block eachother via cuda's memcpy lock.

OK, thanks. If we're not able to show how it helps something in-tree, then I agree with having it off by default.

Can you expand on the doc text to add some guidance on when it would be useful? Something like "useful when using custom logits processors that either (a) launch additional CUDA kernels or (b) do significant CPU-bound work while not holding the python GIL, or both."

Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>

akeshet · 2025-01-29T19:18:26Z

Can you expand on the doc text to add some guidance on when it would be useful? Something like "useful when using custom logits processors that either (a) launch additional CUDA kernels or (b) do significant CPU-bound work while not holding the python GIL, or both."

Done.

russellb

lgtm, thanks!

mgoin

LGTM given this is limited to within logits_processor.py and is disabled by default, thanks

…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> Signed-off-by: Felix Marty <felmarty@amd.com>

…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>

alejopaullier96 · 2025-02-27T13:19:57Z

@akeshet is there a minimal example on how to use logits processor with this environment variable?

akeshet · 2025-02-28T00:15:44Z

@akeshet is there a minimal example on how to use logits processor with this environment variable?

@alejopaullier96 The minimum example would be simply to export VLLM_LOGS_PROCESSOR_THREADS=$VALUE prior to launching vllm. But it will really depend on the details of yours logits processor you are using, as to whether this gives any performance benefit. I came to this optimization by examining nsys traces of our workload with a custom logits processor.

…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> Signed-off-by: Linkun Chen <github@lkchen.net>

…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> Signed-off-by: saeediy <saidakbarp@gmail.com>

…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>

akeshet changed the title ~~[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS~~ [Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS Jan 23, 2025

[core] add and implement VLLM_LOGITS_PROCESSOR_THREADS

ac402d3

Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>

akeshet force-pushed the akeshet/threaded_logits branch from dc3aa96 to ac402d3 Compare January 23, 2025 20:59

akeshet added 2 commits January 23, 2025 21:23

[Core] fix line lengths

748c9b3

Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>

[Core] run format.sh

5339556

Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>

[Core] update in response to review comments

56ae7c2

Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>

russellb approved these changes Jan 29, 2025

View reviewed changes

Merge branch 'main' into akeshet/threaded_logits

59a2281

mgoin approved these changes Feb 4, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 4, 2025

simon-mo merged commit b3a0d01 into vllm-project:main Feb 5, 2025
45 of 48 checks passed

fxmarty-amd pushed a commit to fxmarty-amd/vllm that referenced this pull request Feb 7, 2025

[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS (vllm-projec…

a662ba5

…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> Signed-off-by: Felix Marty <felmarty@amd.com>

ShangmingCai pushed a commit to ShangmingCai/vllm that referenced this pull request Feb 10, 2025

[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS (vllm-projec…

b228d9b

…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>

panf2333 pushed a commit to yottalabsai/vllm that referenced this pull request Feb 18, 2025

[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS (vllm-projec…

19b8445

…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>

kerthcet pushed a commit to kerthcet/vllm that referenced this pull request Feb 21, 2025

[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS (vllm-projec…

8eac328

…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Mar 5, 2025

[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS (vllm-projec…

11a48b2

…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> Signed-off-by: Linkun Chen <github@lkchen.net>

Said-Akbar pushed a commit to Said-Akbar/vllm-rocm that referenced this pull request Mar 7, 2025

[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS (vllm-projec…

94000f3

…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> Signed-off-by: saeediy <saidakbarp@gmail.com>

qscqesze pushed a commit to ZZBoom/vllm that referenced this pull request Mar 13, 2025

[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS (vllm-projec…

d51330e

…t#12368) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] add and implement `VLLM_LOGITS_PROCESSOR_THREADS` #12368

[Core] add and implement `VLLM_LOGITS_PROCESSOR_THREADS` #12368

akeshet commented Jan 23, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 23, 2025

akeshet commented Jan 23, 2025

akeshet commented Jan 23, 2025

russellb commented Jan 28, 2025

akeshet commented Jan 28, 2025

russellb commented Jan 28, 2025

akeshet commented Jan 29, 2025

russellb left a comment

mgoin left a comment

alejopaullier96 commented Feb 27, 2025

akeshet commented Feb 28, 2025

[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS #12368

[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS #12368

Conversation

akeshet commented Jan 23, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 23, 2025

akeshet commented Jan 23, 2025

akeshet commented Jan 23, 2025

russellb commented Jan 28, 2025

akeshet commented Jan 28, 2025

russellb commented Jan 28, 2025

akeshet commented Jan 29, 2025

russellb left a comment

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

alejopaullier96 commented Feb 27, 2025

akeshet commented Feb 28, 2025

[Core] add and implement `VLLM_LOGITS_PROCESSOR_THREADS` #12368

[Core] add and implement `VLLM_LOGITS_PROCESSOR_THREADS` #12368

akeshet commented Jan 23, 2025 •

edited by github-actions bot

Loading