Skip to content

[Feature] Support KV cache offloading and disagg prefill with LMCache connector. #12953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 25, 2025

Conversation

YaoJiayi
Copy link
Contributor

@YaoJiayi YaoJiayi commented Feb 8, 2025

LMCache (https://github.com/LMCache/LMCache/tree/dev) uses the kv_transfer interface to support both KV cache offloading and disagg prefill.

The original interfaces recv_kv_caches_and_hidden_states and send_kv_caches_and_hidden_states in kv_connector are used as wrappers to call lmcache_retrieve_kv (retrieves the KV from local cpu, local disk, or remote storage to vllm paged memory) and lmcache_store_kv (extracts the KV from vllm paged memory to local cpu, local disk, or remote storage) respectively.

Copy link

github-actions bot commented Feb 8, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@YaoJiayi YaoJiayi changed the title [Enhancement] Support KV cache offloading and disagg prefill with LMCache connector. [Feature] Support KV cache offloading and disagg prefill with LMCache connector. Feb 8, 2025
@YaoJiayi YaoJiayi marked this pull request as ready for review February 8, 2025 09:04
@Leaf996
Copy link

Leaf996 commented Feb 19, 2025

Do we still need lmcache_vllm ? or just need lmcache ?

Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
@YaoJiayi YaoJiayi force-pushed the localdev/lmcache-connector branch from 1cff5c6 to 12b713e Compare February 19, 2025 17:36
@YaoJiayi
Copy link
Contributor Author

Do we still need lmcache_vllm ? or just need lmcache ?

lmcache-vllm repo is not needed if this PR gets merged.

@YaoJiayi YaoJiayi requested a review from KuntaiDu February 19, 2025 17:40
model_executable: torch.nn.Module,
model_input: "ModelInputForGPUWithSamplingMetadata",
kv_caches: List[torch.Tensor],
hidden_or_intermediate_states: Union[torch.Tensor,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not send hidden_or_intermediate_states to remote cache

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently LMCache assumes that the user only stores KV caches. I guess the API can be extended but it requires some API change. @YaoJiayi does this align with what you are thinking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct:) The last token will be re-prefilled for disagg prefill for now.

Copy link
Collaborator

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

model_executable: torch.nn.Module,
model_input: "ModelInputForGPUWithSamplingMetadata",
kv_caches: List[torch.Tensor],
hidden_or_intermediate_states: Union[torch.Tensor,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently LMCache assumes that the user only stores KV caches. I guess the API can be extended but it requires some API change. @YaoJiayi does this align with what you are thinking?

@KuntaiDu KuntaiDu self-requested a review February 22, 2025 13:43
@KuntaiDu KuntaiDu enabled auto-merge (squash) February 24, 2025 16:11
@KuntaiDu KuntaiDu added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 24, 2025
@simon-mo simon-mo merged commit 2f42a48 into vllm-project:main Feb 25, 2025
59 of 63 checks passed
Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025
@mergify mergify bot added the documentation Improvements or additions to documentation label Mar 6, 2025
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
… connector. (vllm-project#12953)

Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants