-
Notifications
You must be signed in to change notification settings - Fork 1.5k
perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench #3041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@kaiyux to make sure he is aware of the addition of AutoDeploy as another backend of trtllm-bench. Thanks |
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
8fb54ab
to
d58d8d6
Compare
/bot run |
PR_Github #355 [ run ] triggered by Bot |
PR_Github #355 [ run ] completed with state |
/bot run |
PR_Github #376 [ run ] triggered by Bot |
PR_Github #376 [ run ] completed with state |
The @suyoggupta Is it possible to split the PR so that it only includes the changes of "enables the integration of TRTLLM-bench with AutoDeploy"? I see there are also a bunch of changes under AutoDeploy core |
@kaiyux : The changes in autodeploy are needed to ensure the throughput benchmark in tensorrt-llm-bench can run without OOM. Those changes ensure that the autodeploy executor allocates a large enough kv-cache as required by the throughput test. So unfortunately, these changes have be bundled together. |
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
/bot run |
PR_Github #455 [ run ] triggered by Bot |
PR_Github #455 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Just a few minor improvements
/bot run |
PR_Github #475 [ run ] triggered by Bot |
PR_Github #475 [ run ] completed with state |
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
/bot run |
PR_Github #496 [ run ] triggered by Bot |
PR_Github #496 [ run ] completed with state |
/bot run |
PR_Github #514 [ run ] triggered by Bot |
If that's the case, maybe we should merge those changes in core library firstly? There shouldn't be features "required by" the bench scripts, does that make sense? Though I agree to save some overheads by avoid submitting another PR, just wanted to clarify and align the principle here. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving on trtllm-bench
changes. Thanks.
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
PR_Github #514 [ run ] completed with state |
/bot run |
PR_Github #548 [ run ] triggered by Bot |
PR_Github #548 [ run ] completed with state |
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
/bot run |
PR_Github #612 [ run ] triggered by Bot |
I agree with the general principle and something we can follow for subsequent PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
PR_Github #612 [ run ] completed with state |
…IDIA#3041) * Enable AutoDeploy as a backend in trtllm-bench Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * update how caches are resized Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * fix: files permission from 100755 to 100644 Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * some comments Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * Fix function name Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * refactor Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * Remove spurious change Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * Add cursor generated doc strings Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * re-enable ad test Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * some perf cleanup Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * debug ci Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * ensure that overlap scheduler is enabled Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * Reorder the tests Signed-off-by: Suyog Gupta <suyogg@nvidia.com> --------- Signed-off-by: Suyog Gupta <suyogg@nvidia.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Next step is to close the perf gap between AutoDeploy and Pytorch backends.
Current results
Max throughput for llama3.1 8B ISL/OSL=128/128, fp16. H100.
AutoDeploy using FlashInfer attn backend:
Pytorch using FlashInfer backend: