perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench #3041

suyoggupta · 2025-03-24T21:37:50Z

This MR enables the integration of TRTLLM-bench with AutoDeploy.
Adds a feature to AutoDeploy inference optimizer to inflate the kv-caches to the available GPU memory. This helps improve the token throughput.

Next step is to close the perf gap between AutoDeploy and Pytorch backends.
Current results
Max throughput for llama3.1 8B ISL/OSL=128/128, fp16. H100.
AutoDeploy using FlashInfer attn backend:

Request Throughput (req/sec):                     98.6547                                                                                                    
Total Output Throughput (tokens/sec):             12627.7960                                                                                                 
Per User Output Throughput (tokens/sec/user):     14.1241                                                                                                    
Per GPU Output Throughput (tokens/sec/gpu):       12627.7960                                                                                                 
Total Latency (ms):                               10136.3690                                                                                                 
Average request latency (ms):                     9081.9387

Pytorch using FlashInfer backend:

Request Throughput (req/sec):                     115.5171                                                                                                   
Total Output Throughput (tokens/sec):             14786.1895                                                                                                 
Per User Output Throughput (tokens/sec/user):     16.4321                                                                                                    
Per GPU Output Throughput (tokens/sec/gpu):       14786.1895                                                                                                 
Total Latency (ms):                               8656.7266
Average request latency (ms):                     7796.3203

juney-nvidia · 2025-03-24T22:48:50Z

@kaiyux to make sure he is aware of the addition of AutoDeploy as another backend of trtllm-bench.

Thanks
June

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

suyoggupta · 2025-03-25T02:06:39Z

/bot run

niukuo · 2025-03-25T02:15:14Z

PR_Github #355 [ run ] triggered by Bot

niukuo · 2025-03-25T02:27:58Z

PR_Github #355 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #324 completed with status: 'FAILURE'

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

tensorrt_llm/bench/dataclasses/reporting.py

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

suyoggupta · 2025-03-25T05:10:05Z

/bot run

niukuo · 2025-03-25T05:17:17Z

PR_Github #376 [ run ] triggered by Bot

niukuo · 2025-03-25T06:33:07Z

PR_Github #376 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #339 completed with status: 'FAILURE'

kaiyux · 2025-03-25T06:46:25Z

The trtllm-bench part looks good to me.

@suyoggupta Is it possible to split the PR so that it only includes the changes of "enables the integration of TRTLLM-bench with AutoDeploy"? I see there are also a bunch of changes under AutoDeploy core tensorrt_llm/_torch/auto_deploy

suyoggupta · 2025-03-25T15:53:27Z

The trtllm-bench part looks good to me.

@suyoggupta Is it possible to split the PR so that it only includes the changes of "enables the integration of TRTLLM-bench with AutoDeploy"? I see there are also a bunch of changes under AutoDeploy core tensorrt_llm/_torch/auto_deploy

@kaiyux : The changes in autodeploy are needed to ensure the throughput benchmark in tensorrt-llm-bench can run without OOM. Those changes ensure that the autodeploy executor allocates a large enough kv-cache as required by the throughput test. So unfortunately, these changes have be bundled together.

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

suyoggupta · 2025-03-25T17:40:40Z

/bot run

niukuo · 2025-03-25T17:50:11Z

PR_Github #455 [ run ] triggered by Bot

niukuo · 2025-03-25T18:51:13Z

PR_Github #455 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #390 completed with status: 'FAILURE'

lucaslie

Looks great. Just a few minor improvements

tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py

tensorrt_llm/_torch/auto_deploy/transformations/transform.py

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

suyoggupta · 2025-03-25T22:59:24Z

/bot run

niukuo · 2025-03-25T23:08:21Z

PR_Github #475 [ run ] triggered by Bot

niukuo · 2025-03-26T00:59:08Z

PR_Github #475 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #408 completed with status: 'FAILURE'

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

suyoggupta · 2025-03-26T02:22:11Z

/bot run

niukuo · 2025-03-26T02:29:58Z

PR_Github #496 [ run ] triggered by Bot

niukuo · 2025-03-26T04:14:15Z

PR_Github #496 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #428 completed with status: 'FAILURE'

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

suyoggupta · 2025-03-26T04:44:45Z

/bot run

niukuo · 2025-03-26T04:53:40Z

PR_Github #514 [ run ] triggered by Bot

kaiyux · 2025-03-26T06:26:57Z

The trtllm-bench part looks good to me.
@suyoggupta Is it possible to split the PR so that it only includes the changes of "enables the integration of TRTLLM-bench with AutoDeploy"? I see there are also a bunch of changes under AutoDeploy core tensorrt_llm/_torch/auto_deploy

@kaiyux : The changes in autodeploy are needed to ensure the throughput benchmark in tensorrt-llm-bench can run without OOM. Those changes ensure that the autodeploy executor allocates a large enough kv-cache as required by the throughput test. So unfortunately, these changes have be bundled together.

If that's the case, maybe we should merge those changes in core library firstly? There shouldn't be features "required by" the bench scripts, does that make sense?

Though I agree to save some overheads by avoid submitting another PR, just wanted to clarify and align the principle here. Thanks.

kaiyux

Approving on trtllm-bench changes. Thanks.

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

niukuo · 2025-03-26T07:59:16Z

PR_Github #514 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #439 completed with status: 'SUCCESS'

kaiyux · 2025-03-26T08:37:01Z

/bot run

niukuo · 2025-03-26T08:45:32Z

PR_Github #548 [ run ] triggered by Bot

niukuo · 2025-03-26T11:24:04Z

PR_Github #548 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #466 completed with status: 'SUCCESS'

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

suyoggupta · 2025-03-26T17:21:24Z

/bot run

tensorrt-cicd · 2025-03-26T17:29:41Z

PR_Github #612 [ run ] triggered by Bot

suyoggupta · 2025-03-26T17:52:23Z

If that's the case, maybe we should merge those changes in core library firstly? There shouldn't be features "required by" the bench scripts, does that make sense?

Though I agree to save some overheads by avoid submitting another PR, just wanted to clarify and align the principle here. Thanks.

I agree with the general principle and something we can follow for subsequent PRs.

lucaslie

lgtm

tensorrt-cicd · 2025-03-26T19:31:16Z

PR_Github #612 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #516 completed with status: 'SUCCESS'

…IDIA#3041) * Enable AutoDeploy as a backend in trtllm-bench Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * update how caches are resized Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * fix: files permission from 100755 to 100644 Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * some comments Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * Fix function name Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * refactor Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * Remove spurious change Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * Add cursor generated doc strings Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * re-enable ad test Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * some perf cleanup Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * debug ci Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * ensure that overlap scheduler is enabled Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * Reorder the tests Signed-off-by: Suyog Gupta <suyogg@nvidia.com> --------- Signed-off-by: Suyog Gupta <suyogg@nvidia.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

suyoggupta requested review from lucaslie and FrankD412 March 24, 2025 21:39

juney-nvidia changed the title ~~[AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench~~ perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench Mar 24, 2025

juney-nvidia requested a review from kaiyux March 24, 2025 22:48

suyoggupta added 4 commits March 24, 2025 18:56

Enable AutoDeploy as a backend in trtllm-bench

4356473

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

update how caches are resized

19c7699

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

fix: files permission from 100755 to 100644

12cadfa

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

some comments

d58d8d6

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

suyoggupta force-pushed the user/sg/autodeploy-fix branch from 8fb54ab to d58d8d6 Compare March 25, 2025 01:56

FrankD412 reviewed Mar 25, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Show resolved Hide resolved

tensorrt_llm/bench/dataclasses/reporting.py Show resolved Hide resolved

suyoggupta added 4 commits March 24, 2025 21:54

lint

cfc3a56

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

lint

d7a3b77

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

lint

d740a2c

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

lint

b4f20e9

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

Fix function name

69fd95c

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

lucaslie reviewed Mar 25, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/transformations/transform.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/transformations/transform.py Outdated Show resolved Hide resolved

suyoggupta and others added 2 commits March 25, 2025 12:58

refactor

35ed917

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

Merge branch 'main' into user/sg/autodeploy-fix

855ea3a

suyoggupta and others added 2 commits March 25, 2025 19:14

some perf cleanup

6c0088b

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

Merge branch 'main' into user/sg/autodeploy-fix

fdb997c

debug ci

77ce4fe

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

kaiyux approved these changes Mar 26, 2025

View reviewed changes

ensure that overlap scheduler is enabled

9d098e8

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

Merge branch 'main' into user/sg/autodeploy-fix

ff0656d

suyoggupta and others added 2 commits March 26, 2025 08:51

Merge branch 'main' into user/sg/autodeploy-fix

30f34f0

Reorder the tests

79fd2df

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

lucaslie approved these changes Mar 26, 2025

View reviewed changes

suyoggupta merged commit 047f2b2 into NVIDIA:main Mar 26, 2025
2 checks passed

perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench #3041

perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench #3041

Uh oh!

Conversation

suyoggupta commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juney-nvidia commented Mar 24, 2025

Uh oh!

suyoggupta commented Mar 25, 2025

Uh oh!

niukuo commented Mar 25, 2025

Uh oh!

niukuo commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

suyoggupta commented Mar 25, 2025

Uh oh!

niukuo commented Mar 25, 2025

Uh oh!

niukuo commented Mar 25, 2025

Uh oh!

kaiyux commented Mar 25, 2025

Uh oh!

suyoggupta commented Mar 25, 2025

Uh oh!

suyoggupta commented Mar 25, 2025

Uh oh!

niukuo commented Mar 25, 2025

Uh oh!

niukuo commented Mar 25, 2025

Uh oh!

lucaslie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suyoggupta commented Mar 25, 2025

Uh oh!

niukuo commented Mar 25, 2025

Uh oh!

niukuo commented Mar 26, 2025

Uh oh!

suyoggupta commented Mar 26, 2025

Uh oh!

niukuo commented Mar 26, 2025

Uh oh!

niukuo commented Mar 26, 2025

Uh oh!

suyoggupta commented Mar 26, 2025

Uh oh!

niukuo commented Mar 26, 2025

Uh oh!

kaiyux commented Mar 26, 2025

Uh oh!

kaiyux left a comment

Choose a reason for hiding this comment

Uh oh!

niukuo commented Mar 26, 2025

Uh oh!

kaiyux commented Mar 26, 2025

Uh oh!

niukuo commented Mar 26, 2025

Uh oh!

niukuo commented Mar 26, 2025

Uh oh!

suyoggupta commented Mar 26, 2025

Uh oh!

tensorrt-cicd commented Mar 26, 2025

Uh oh!

suyoggupta commented Mar 26, 2025

Uh oh!

lucaslie left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Mar 26, 2025

Uh oh!

Uh oh!

suyoggupta commented Mar 24, 2025 •

edited

Loading