-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
[Core][AMD] Migrate fully transparent sleep mode to ROCm platform #12695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Hi @youkaichao! Thank you for your amazing work in #11743. I've checked all the APIs and found that all the CuMem calls used here actually have equivalent Hip versions, so I tried to migrate the feature into ROCm as well. I almost succeeded with all the code compiles, but we had a strange OOM issue with For easier reproduction, I modified the demo and pushed it here: https://github.com/HollowMan6/vllm_allocator_adaptor The error log when running
|
15e1489
to
261ee50
Compare
Hi @HollowMan6, just a quick feedback data point as I was testing this PR. For V0 inference seems to work and the " from vllm.device_allocator.cumem import CuMemAllocator" import error I was facing on main is fixed but for V1 I get this error:
Not sure if you're aware of this, so I thought I quickly leave comment. Let me know if you need more info to repro. EDIT: |
this usually means error happens for when I implement the PR, the most difficult part is to make sure the call to cuda driver API succeeds. I don't know if simply replacing some functions works to migrate these API calls. We need rocm experts to confirm the API usage, as they are quite low-level (and error-prone). |
Hi @mreso! Thank you for your feedback! Are you sure it's working for the sleep/wake without the strange OOM error on AMD with this PR? It looks like you failed to link to the ROCm library, so the sleep/wake shouldn't work here. Remove
Oh really? it sounds like they have some failed logic in checking whether you are using NVIDIA or AMD, as in the original codebase, CuMemAllocator is disabled for AMD GPUs. I can't produce this on my side, though, so maybe you want to file a separate issue about this. Regarding the V1 one, neither this PR nor #11743 made any modification to this, so it should be a separate issue, too. |
Okay, it turns out to be a hardware issue on my side, I'm using MI250x, and it looks like it doesn't support virtual memory management, as is confirmed by https://github.com/ROCm/hip-tests/blob/84a460d96bafb00615969304cc5eaddc3b20bc3d/catch/unit/memory/hip_vmm_common.hh#L27-L37 Maybe this can work on some advanced AMD hardware, such as MI300+? Unfortunately, I don't have access to hardware such as MI300+, so I can't help with this PR any further. I will set this PR as ready as directly mapping those CUDA calls seems to be OK, we also have some test cases here for reference: https://github.com/ROCm/hip-tests/blob/84a460d96bafb00615969304cc5eaddc3b20bc3d/catch/unit/virtualMemoryManagement/hipMemSetGetAccess.cc#L213-L256 Feel free to take this PR over @youkaichao or anyone who is interested and has those GPUs. I have set this PR to allow edits and access to secrets by maintainers |
@hongxiayang are you able to help with this? |
Thanks for your contribution. @HollowMan6 @DarkLight1337 I will check this when I have more bandwidth. |
Thanks @HollowMan6 you're right, that will be an env issue and I was not specifically testing sleep/wake explicitly so it will most likely crash if I did.
Interesting that this does not work in my case. This line is causing an exception on my end and I assume if cuda is not present its supposed to swallow the ModuleNotFoundError and disable the CuMemAllocator that way, right? But in my case an AssertionError is thrown here which is not caught. If I raise a ModuleNotFoundError error instead its working. Is there another mechanism I am missing that should disable CuMemAllocator? |
What hardware are you using? If it's MI300 and above, then it's likely that it supports virtual memory management and the sleep/wake mode will not crash in that OOM way, you are welcome to test this out and let us know what you found out!
In main, |
Thats it, thanks a lot! Was an unclean build env. There is no trace of CUDA on the machines but it must have created the file anyways and so the ModuleNotFoundError was not raised... After removing all files and rebuilding it works. I am on MI300s and will give it a try. |
Gave it a quick try using this test script:
but seems like sleep did not free any memory and after wakeup I got an OOM error:
This is my hw:
Let me know if the test script is incorrect for this or I can try anything else. |
Thank you @mreso! It looks like MI300 supports virtual memory management, but it gives an OOM error at a different line and then ends up with a segment fault, maybe Hip APIs should be called in a different way than CUDA. Unfortunately, I can't help with this PR any further as I don't have MI300 in my hand, but thank you anyway, @mreso! |
I can also confirm that the current solution does not work on mi300 (on v0.7.2).
|
By the way, feel free to tell me what I can help with on this pull. |
f0cc987
to
d1948f1
Compare
This pull request has merge conflicts that must be resolved before it can be |
@HollowMan6 @hongxiayang I can't seem to compile due to missing linker
This compilation error is resolved through:
HIP-TESTs of VirtualMemoryManagementI am using the hip and rocm in the vLLM docker image to build the test I am getting two failure when running VirtualMemoryManagement tests
HIP-TEST setup
|
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: kf <kuanfu.liu@embeddedllm.com> Signed-off-by: Hollow Man <hollowman@opensuse.org>
…= "fork") -> Callable: Added in 2bb0e1a ? Feel free to revert Signed-off-by: Hollow Man <hollowman@opensuse.org>
hipMemAddressReserve requires the aligment parameter, which is set to 0 in cuMemAddressReserve. Signed-off-by: Hollow Man <hollowman@opensuse.org>
reduce memory chunk size to reduce rocm hang and save more memory. Signed-off-by: Hollow Man <hollowman@opensuse.org>
hi @kliuae, may I ask why your code set CHUNK_SIZE to 512 MB and does this setting has some benefits? |
Hi @YangWang92, our thinking behind this was that we would like to avoid fragmenting the allocation as much as possible, and in our brief testing on this, increasing the chunk size did reduce the first model loading and sleep time. The next power of two beyond 512MB yielded the same OOM symptom as the unchunked version. |
Thanks for your reply, I found a new issue about the slow initialization of vllm on rocm, which is related to rocm rocr runtime and clr. I think rocm will fix these issues later. |
It looks like `actor_rollout_ref.rollout.free_cache_engine` control option only works for vLLM version 0.5.4 and 0.6.3, the sleep / wake up mode in vLLM engine, as well as release / resume memory occupation in SGLang is enabled by default and there's no way to turn them off. While always alllowing inference engine to free cache can be ideal, it's unfortunately not supported on some devices, such as AMD Mi250x, since it doesn't support virtual memory management: vllm-project/vllm#12695 (comment) So we would need to be able to turn it off so that verl can run on those devices. In addition, it looks like we no longer need to enforce eager in latest vLLM when we choose to free cache, so this PR also lifted this restriction. Signed-off-by: Hollow Man <hollowman@opensuse.org>
It looks like `actor_rollout_ref.rollout.free_cache_engine` control option only works for vLLM version 0.5.4 and 0.6.3, the sleep / wake up mode in vLLM engine, as well as release / resume memory occupation in SGLang is enabled by default and there's no way to turn them off. While always alllowing inference engine to free cache can be ideal, it's unfortunately not supported on some devices, such as AMD Mi250x, since it doesn't support virtual memory management: vllm-project/vllm#12695 (comment) So we would need to be able to turn it off so that verl can run on those devices. In addition, it looks like we no longer need to enforce eager in latest vLLM when we choose to free cache, so this PR also lifted this restriction. Signed-off-by: Hollow Man <hollowman@opensuse.org>
It looks like `actor_rollout_ref.rollout.free_cache_engine` control option only works for vLLM version 0.5.4 and 0.6.3, the sleep / wake up mode in vLLM engine, as well as release / resume memory occupation in SGLang is enabled by default and there's no way to turn them off. While always alllowing inference engine to free cache can be ideal, it's unfortunately not supported on some devices, such as AMD Mi250x, since it doesn't support virtual memory management: vllm-project/vllm#12695 (comment) So we would need to be able to turn it off so that verl can run on those devices. In addition, it looks like we no longer need to enforce eager in latest vLLM when we choose to free cache, so this PR also lifted this restriction. Signed-off-by: Hollow Man <hollowman@opensuse.org>
It looks like `actor_rollout_ref.rollout.free_cache_engine` control option only works for vLLM version 0.5.4 and 0.6.3, the sleep / wake up mode in vLLM engine, as well as release / resume memory occupation in SGLang is enabled by default and there's no way to turn them off. While always alllowing inference engine to free cache can be ideal, it's unfortunately not supported on some devices, such as AMD Mi250x, since it doesn't support virtual memory management: vllm-project/vllm#12695 (comment) So we would need to be able to turn it off so that verl can run on those devices. In addition, it looks like we no longer need to enforce eager in latest vLLM when we choose to free cache, so this PR also lifted this restriction. Signed-off-by: Hollow Man <hollowman@opensuse.org>
It looks like `actor_rollout_ref.rollout.free_cache_engine` control option only works for vLLM version 0.5.4 and 0.6.3, the sleep / wake up mode in vLLM engine, as well as release / resume memory occupation in SGLang is enabled by default and there's no way to turn them off. While always alllowing inference engine to free cache can be ideal, it's unfortunately not supported on some devices, such as AMD Mi250x, since it doesn't support virtual memory management: vllm-project/vllm#12695 (comment) So we would need to be able to turn it off so that verl can run on those devices. In addition, it looks like we no longer need to enforce eager in latest vLLM when we choose to free cache, so this PR also lifted this restriction. Signed-off-by: Hollow Man <hollowman@opensuse.org>
It looks like `actor_rollout_ref.rollout.free_cache_engine` control option only works for vLLM version 0.5.4 and 0.6.3, the sleep / wake up mode in vLLM engine, as well as release / resume memory occupation in SGLang is enabled by default and there's no way to turn them off. While always alllowing inference engine to free cache can be ideal, it's unfortunately not supported on some devices, such as AMD Mi250x, since it doesn't support virtual memory management: vllm-project/vllm#12695 (comment) So we would need to be able to turn it off so that verl can run on those devices. In addition, it looks like we no longer need to enforce eager in latest vLLM when we choose to free cache, so this PR also lifted this restriction. Signed-off-by: Hollow Man <hollowman@opensuse.org>
It looks like `actor_rollout_ref.rollout.free_cache_engine` control option only works for vLLM version 0.5.4 and 0.6.3, the sleep / wake up mode in vLLM engine, as well as release / resume memory occupation in SGLang is enabled by default and there's no way to turn them off. While always alllowing inference engine to free cache can be ideal, it's unfortunately not supported on some devices, such as AMD Mi250x, since it doesn't support virtual memory management: vllm-project/vllm#12695 (comment) So we would need to be able to turn it off so that verl can run on those devices. In addition, it looks like we no longer need to enforce eager in latest vLLM when we choose to free cache, so this PR also lifted this restriction. Signed-off-by: Hollow Man <hollowman@opensuse.org>
It looks like `actor_rollout_ref.rollout.free_cache_engine` control option only works for vLLM version 0.5.4 and 0.6.3, the sleep / wake up mode in vLLM engine, as well as release / resume memory occupation in SGLang is enabled by default and there's no way to turn them off. While always alllowing inference engine to free cache can be ideal, it's unfortunately not supported on some devices, such as AMD Mi250x, since it doesn't support virtual memory management: vllm-project/vllm#12695 (comment) So we would need to be able to turn it off so that verl can run on those devices. In addition, it looks like we no longer need to enforce eager in latest vLLM when we choose to free cache, so this PR also lifted this restriction. Signed-off-by: Hollow Man <hollowman@opensuse.org>
It looks like `actor_rollout_ref.rollout.free_cache_engine` control option only works for vLLM version 0.5.4 and 0.6.3, the sleep / wake up mode in vLLM engine, as well as release / resume memory occupation in SGLang is enabled by default and there's no way to turn them off. While always alllowing inference engine to free cache can be ideal, it's unfortunately not supported on some devices, such as AMD Mi250x, since it doesn't support virtual memory management: vllm-project/vllm#12695 (comment) So we would need to be able to turn it off so that verl can run on those devices. In addition, it looks like we no longer need to enforce eager in latest vLLM when we choose to free cache, so this PR also lifted this restriction. Signed-off-by: Hollow Man <hollowman@opensuse.org>
It looks like `actor_rollout_ref.rollout.free_cache_engine` control option only works for vLLM version 0.5.4 and 0.6.3, the sleep / wake up mode in vLLM engine, as well as release / resume memory occupation in SGLang is enabled by default and there's no way to turn them off. While always alllowing inference engine to free cache can be ideal, it's unfortunately not supported on some devices, such as AMD Mi250x, since it doesn't support virtual memory management: vllm-project/vllm#12695 (comment) So we would need to be able to turn it off so that verl can run on those devices. In addition, it looks like we no longer need to enforce eager in latest vLLM when we choose to free cache, so this PR also lifted this restriction. Signed-off-by: Hollow Man <hollowman@opensuse.org>
FIX #10714 for AMD GPUs
Related to #11743