Skip to content

perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench #3041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Mar 26, 2025

Conversation

suyoggupta
Copy link
Collaborator

@suyoggupta suyoggupta commented Mar 24, 2025

  1. This MR enables the integration of TRTLLM-bench with AutoDeploy.
  2. Adds a feature to AutoDeploy inference optimizer to inflate the kv-caches to the available GPU memory. This helps improve the token throughput.

Next step is to close the perf gap between AutoDeploy and Pytorch backends.
Current results
Max throughput for llama3.1 8B ISL/OSL=128/128, fp16. H100.
AutoDeploy using FlashInfer attn backend:

Request Throughput (req/sec):                     98.6547                                                                                                    
Total Output Throughput (tokens/sec):             12627.7960                                                                                                 
Per User Output Throughput (tokens/sec/user):     14.1241                                                                                                    
Per GPU Output Throughput (tokens/sec/gpu):       12627.7960                                                                                                 
Total Latency (ms):                               10136.3690                                                                                                 
Average request latency (ms):                     9081.9387  

Pytorch using FlashInfer backend:

Request Throughput (req/sec):                     115.5171                                                                                                   
Total Output Throughput (tokens/sec):             14786.1895                                                                                                 
Per User Output Throughput (tokens/sec/user):     16.4321                                                                                                    
Per GPU Output Throughput (tokens/sec/gpu):       14786.1895                                                                                                 
Total Latency (ms):                               8656.7266
Average request latency (ms):                     7796.3203        

@juney-nvidia juney-nvidia changed the title [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench Mar 24, 2025
@juney-nvidia juney-nvidia requested a review from kaiyux March 24, 2025 22:48
@juney-nvidia
Copy link
Collaborator

@kaiyux to make sure he is aware of the addition of AutoDeploy as another backend of trtllm-bench.

Thanks
June

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
@suyoggupta suyoggupta force-pushed the user/sg/autodeploy-fix branch from 8fb54ab to d58d8d6 Compare March 25, 2025 01:56
@suyoggupta
Copy link
Collaborator Author

/bot run

@niukuo
Copy link
Collaborator

niukuo commented Mar 25, 2025

PR_Github #355 [ run ] triggered by Bot

@niukuo
Copy link
Collaborator

niukuo commented Mar 25, 2025

PR_Github #355 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #324 completed with status: 'FAILURE'

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
@suyoggupta
Copy link
Collaborator Author

/bot run

@niukuo
Copy link
Collaborator

niukuo commented Mar 25, 2025

PR_Github #376 [ run ] triggered by Bot

@niukuo
Copy link
Collaborator

niukuo commented Mar 25, 2025

PR_Github #376 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #339 completed with status: 'FAILURE'

@kaiyux
Copy link
Member

kaiyux commented Mar 25, 2025

The trtllm-bench part looks good to me.

@suyoggupta Is it possible to split the PR so that it only includes the changes of "enables the integration of TRTLLM-bench with AutoDeploy"? I see there are also a bunch of changes under AutoDeploy core tensorrt_llm/_torch/auto_deploy

@suyoggupta
Copy link
Collaborator Author

The trtllm-bench part looks good to me.

@suyoggupta Is it possible to split the PR so that it only includes the changes of "enables the integration of TRTLLM-bench with AutoDeploy"? I see there are also a bunch of changes under AutoDeploy core tensorrt_llm/_torch/auto_deploy

@kaiyux : The changes in autodeploy are needed to ensure the throughput benchmark in tensorrt-llm-bench can run without OOM. Those changes ensure that the autodeploy executor allocates a large enough kv-cache as required by the throughput test. So unfortunately, these changes have be bundled together.

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
@suyoggupta
Copy link
Collaborator Author

/bot run

@niukuo
Copy link
Collaborator

niukuo commented Mar 25, 2025

PR_Github #455 [ run ] triggered by Bot

@niukuo
Copy link
Collaborator

niukuo commented Mar 25, 2025

PR_Github #455 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #390 completed with status: 'FAILURE'

Copy link
Member

@lucaslie lucaslie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Just a few minor improvements

suyoggupta and others added 2 commits March 25, 2025 12:58
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
@suyoggupta
Copy link
Collaborator Author

/bot run

@niukuo
Copy link
Collaborator

niukuo commented Mar 25, 2025

PR_Github #475 [ run ] triggered by Bot

@niukuo
Copy link
Collaborator

niukuo commented Mar 26, 2025

PR_Github #475 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #408 completed with status: 'FAILURE'

suyoggupta and others added 2 commits March 25, 2025 19:14
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
@suyoggupta
Copy link
Collaborator Author

/bot run

@niukuo
Copy link
Collaborator

niukuo commented Mar 26, 2025

PR_Github #496 [ run ] triggered by Bot

@niukuo
Copy link
Collaborator

niukuo commented Mar 26, 2025

PR_Github #496 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #428 completed with status: 'FAILURE'

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
@suyoggupta
Copy link
Collaborator Author

/bot run

@niukuo
Copy link
Collaborator

niukuo commented Mar 26, 2025

PR_Github #514 [ run ] triggered by Bot

@kaiyux
Copy link
Member

kaiyux commented Mar 26, 2025

The trtllm-bench part looks good to me.
@suyoggupta Is it possible to split the PR so that it only includes the changes of "enables the integration of TRTLLM-bench with AutoDeploy"? I see there are also a bunch of changes under AutoDeploy core tensorrt_llm/_torch/auto_deploy

@kaiyux : The changes in autodeploy are needed to ensure the throughput benchmark in tensorrt-llm-bench can run without OOM. Those changes ensure that the autodeploy executor allocates a large enough kv-cache as required by the throughput test. So unfortunately, these changes have be bundled together.

If that's the case, maybe we should merge those changes in core library firstly? There shouldn't be features "required by" the bench scripts, does that make sense?

Though I agree to save some overheads by avoid submitting another PR, just wanted to clarify and align the principle here. Thanks.

Copy link
Member

@kaiyux kaiyux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on trtllm-bench changes. Thanks.

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
@niukuo
Copy link
Collaborator

niukuo commented Mar 26, 2025

PR_Github #514 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #439 completed with status: 'SUCCESS'

@kaiyux
Copy link
Member

kaiyux commented Mar 26, 2025

/bot run

@niukuo
Copy link
Collaborator

niukuo commented Mar 26, 2025

PR_Github #548 [ run ] triggered by Bot

@niukuo
Copy link
Collaborator

niukuo commented Mar 26, 2025

PR_Github #548 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #466 completed with status: 'SUCCESS'

suyoggupta and others added 2 commits March 26, 2025 08:51
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #612 [ run ] triggered by Bot

@suyoggupta
Copy link
Collaborator Author

If that's the case, maybe we should merge those changes in core library firstly? There shouldn't be features "required by" the bench scripts, does that make sense?

Though I agree to save some overheads by avoid submitting another PR, just wanted to clarify and align the principle here. Thanks.

I agree with the general principle and something we can follow for subsequent PRs.

Copy link
Member

@lucaslie lucaslie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@tensorrt-cicd
Copy link
Collaborator

PR_Github #612 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #516 completed with status: 'SUCCESS'

@suyoggupta suyoggupta merged commit 047f2b2 into NVIDIA:main Mar 26, 2025
2 checks passed
wu1du2 pushed a commit to wu1du2/TensorRT-LLM that referenced this pull request May 11, 2025
…IDIA#3041)

* Enable AutoDeploy as a backend in trtllm-bench

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* update how caches are resized

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* fix: files permission from 100755 to 100644

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* some comments

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* Fix function name

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* refactor

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* Remove spurious change

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* Add cursor generated doc strings

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* re-enable ad test

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* some perf cleanup

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* debug ci

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* ensure that overlap scheduler is enabled

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* Reorder the tests

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

---------

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants