[Core][AMD] Migrate fully transparent sleep mode to ROCm platform #12695

HollowMan6 · 2025-02-03T15:29:58Z

FIX #10714 for AMD GPUs

Related to #11743

github-actions · 2025-02-03T15:30:12Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

HollowMan6 · 2025-02-03T15:39:36Z

Hi @youkaichao! Thank you for your amazing work in #11743. I've checked all the APIs and found that all the CuMem calls used here actually have equivalent Hip versions, so I tried to migrate the feature into ROCm as well. I almost succeeded with all the code compiles, but we had a strange OOM issue with hipMemAddressReserve when we tried to allocate within allocator.use_memory_pool(). So just want to check with you / anyone familiar with this to see how we could resolve this.

For easier reproduction, I modified the demo and pushed it here: https://github.com/HollowMan6/vllm_allocator_adaptor

The error log when running test.py

amdgpu.ids: No such file or directory
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')
[vllm_allocator_adaptor_c] create_and_map: device=0x3, size=20971520
[vllm_allocator_adaptor_c] CUDA Error: out of memory at vllm_allocator_adaptor_c.cpp:197
[vllm_allocator_adaptor_c] CUDA Error: out of memory at vllm_allocator_adaptor_c.cpp:137
[vllm_allocator_adaptor_c] CUDA Error: invalid argument at vllm_allocator_adaptor_c.cpp:138
[vllm_allocator_adaptor_c] CUDA Error: invalid argument at vllm_allocator_adaptor_c.cpp:145
[vllm_allocator_adaptor_c] create_and_map: device=0, size=20971520, d_mem=0, p_memHandle=0x80b7a00
Traceback (most recent call last):
  File "vllm_allocator_adaptor/test.py", line 53, in <module>
    y = torch.empty(shape, device='cuda')
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 63.73 GiB is free. Of the allocated memory 4.00 MiB is allocated by PyTorch, and 18.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

mreso · 2025-02-04T00:16:35Z

Hi @HollowMan6, just a quick feedback data point as I was testing this PR. For V0 inference seems to work and the " from vllm.device_allocator.cumem import CuMemAllocator" import error I was facing on main is fixed but for V1 I get this error:

INFO 02-03 16:07:28 model_runner.py:1118] Loading model weights took 14.9888 GB
ERROR 02-03 16:07:28 core.py:210] EngineCore hit an exception: Traceback (most recent call last):
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/v1/engine/core.py", line 202, in run_engine_core
ERROR 02-03 16:07:28 core.py:210]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 02-03 16:07:28 core.py:210]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/v1/engine/core.py", line 156, in __init__
ERROR 02-03 16:07:28 core.py:210]     super().__init__(vllm_config, executor_class)
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/v1/engine/core.py", line 54, in __init__
ERROR 02-03 16:07:28 core.py:210]     num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
ERROR 02-03 16:07:28 core.py:210]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/v1/engine/core.py", line 75, in _initialize_kv_caches
ERROR 02-03 16:07:28 core.py:210]     kv_cache_spec = self.model_executor.get_kv_cache_spec()
ERROR 02-03 16:07:28 core.py:210]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/v1/executor/abstract.py", line 68, in get_kv_cache_spec
ERROR 02-03 16:07:28 core.py:210]     output = self.collective_rpc("get_kv_cache_spec")
ERROR 02-03 16:07:28 core.py:210]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
ERROR 02-03 16:07:28 core.py:210]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 02-03 16:07:28 core.py:210]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/utils.py", line 2206, in run_method
ERROR 02-03 16:07:28 core.py:210]     raise NotImplementedError(f"Method {method!r} is not"
ERROR 02-03 16:07:28 core.py:210] NotImplementedError: Method 'get_kv_cache_spec' is not implemented.
ERROR 02-03 16:07:28 core.py:210]
CRITICAL 02-03 16:07:28 core_client.py:158] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

Not sure if you're aware of this, so I thought I quickly leave comment. Let me know if you need more info to repro.

EDIT:
Oh and I had to remove "list(APPEND CUMEM_LIBS amdhip64)" in my env, otherwise I got a linker error: /usr/bin/ld: cannot find -lamdhip64.

youkaichao · 2025-02-04T09:50:48Z

but we had a strange OOM issue with hipMemAddressReserve when we tried to allocate within allocator.use_memory_pool()

this usually means error happens for hipMemAddressReserve . it is not really an OOM error.

when I implement the PR, the most difficult part is to make sure the call to cuda driver API succeeds. I don't know if simply replacing some functions works to migrate these API calls. We need rocm experts to confirm the API usage, as they are quite low-level (and error-prone).

HollowMan6 · 2025-02-04T12:20:03Z

Hi @HollowMan6, just a quick feedback data point as I was testing this PR. For V0 inference seems to work

EDIT: Oh and I had to remove "list(APPEND CUMEM_LIBS amdhip64)" in my env, otherwise I got a linker error: /usr/bin/ld: cannot find -lamdhip64.

Hi @mreso! Thank you for your feedback! Are you sure it's working for the sleep/wake without the strange OOM error on AMD with this PR? It looks like you failed to link to the ROCm library, so the sleep/wake shouldn't work here. Remove list(APPEND CUMEM_LIBS amdhip64) is not the correct solution as it will stop it from linking to the library correctly. The issue you have is more like an environment issue, so maybe you can try something like export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH for now, maybe there is a better way to handle this properly in the CMakeLists.txt, but let's first get the hipMemAddressReserve working.

The " from vllm.device_allocator.cumem import CuMemAllocator" import error I was facing on main is fixed but for V1 I get this error:

Oh really? it sounds like they have some failed logic in checking whether you are using NVIDIA or AMD, as in the original codebase, CuMemAllocator is disabled for AMD GPUs. I can't produce this on my side, though, so maybe you want to file a separate issue about this. Regarding the V1 one, neither this PR nor #11743 made any modification to this, so it should be a separate issue, too.

HollowMan6 · 2025-02-04T13:26:08Z

Okay, it turns out to be a hardware issue on my side, I'm using MI250x, and it looks like it doesn't support virtual memory management, as is confirmed by https://github.com/ROCm/hip-tests/blob/84a460d96bafb00615969304cc5eaddc3b20bc3d/catch/unit/memory/hip_vmm_common.hh#L27-L37

Maybe this can work on some advanced AMD hardware, such as MI300+? Unfortunately, I don't have access to hardware such as MI300+, so I can't help with this PR any further. I will set this PR as ready as directly mapping those CUDA calls seems to be OK, we also have some test cases here for reference: https://github.com/ROCm/hip-tests/blob/84a460d96bafb00615969304cc5eaddc3b20bc3d/catch/unit/virtualMemoryManagement/hipMemSetGetAccess.cc#L213-L256

Feel free to take this PR over @youkaichao or anyone who is interested and has those GPUs. I have set this PR to allow edits and access to secrets by maintainers

DarkLight1337 · 2025-02-04T13:39:26Z

@hongxiayang are you able to help with this?

hongxiayang · 2025-02-04T15:13:42Z

@hongxiayang are you able to help with this?

Thanks for your contribution. @HollowMan6

@DarkLight1337 I will check this when I have more bandwidth.

mreso · 2025-02-05T00:57:44Z

Thanks @HollowMan6 you're right, that will be an env issue and I was not specifically testing sleep/wake explicitly so it will most likely crash if I did.

Oh really? it sounds like they have some failed logic in checking whether you are using NVIDIA or AMD, as in the original codebase, CuMemAllocator is disabled for AMD GPUs.

Interesting that this does not work in my case. This line is causing an exception on my end and I assume if cuda is not present its supposed to swallow the ModuleNotFoundError and disable the CuMemAllocator that way, right? But in my case an AssertionError is thrown here which is not caught. If I raise a ModuleNotFoundError error instead its working.

Is there another mechanism I am missing that should disable CuMemAllocator?

HollowMan6 · 2025-02-05T08:20:10Z

@mreso

I was not specifically testing sleep/wake explicitly so it will most likely crash if I did.

What hardware are you using? If it's MI300 and above, then it's likely that it supports virtual memory management and the sleep/wake mode will not crash in that OOM way, you are welcome to test this out and let us know what you found out!

Interesting that this does not work in my case. This line is causing an exception on my end and I assume if cuda is not present its supposed to swallow the ModuleNotFoundError and disable the CuMemAllocator that way, right? But in my case an AssertionError is thrown here which is not caught. If I raise a ModuleNotFoundError error instead its working.

Is there another mechanism I am missing that should disable CuMemAllocator?

In main, vllm.cumem_allocator shouldn't exist in the first place for AMD, so that's why it catches the ModuleNotFoundError only. That module comes from the C side, so you would need to find out why we have vllm/cumem_allocator.abi3.so (gets compiled in your CMakeLists.txt), while in that CMakeLists.txt, it's specified that it will only compile for CUDA, so most likely an env issue again. Did you install CUDA somehow on your AMD machine? I would also suggest you clean up your workdir, to ensure we don't have any built cumem_allocator.abi3.so, especially if you switch from this PR to the main.

mreso · 2025-02-05T18:19:54Z

In main, vllm.cumem_allocator shouldn't exist in the first place for AMD, so that's why it catches the ModuleNotFoundError only. That module comes from the C side, so you would need to find out why we have vllm/cumem_allocator.abi3.so (gets compiled in your CMakeLists.txt), while in that CMakeLists.txt, it's specified that it will only compile for CUDA, so most likely an env issue again. Did you install CUDA somehow on your AMD machine? I would also suggest you clean up your workdir, to ensure we don't have any built cumem_allocator.abi3.so, especially if you switch from this PR to the main.

Thats it, thanks a lot! Was an unclean build env. There is no trace of CUDA on the machines but it must have created the file anyways and so the ModuleNotFoundError was not raised... After removing all files and rebuilding it works.

I am on MI300s and will give it a try.

mreso · 2025-02-05T19:25:43Z

Gave it a quick try using this test script:

import torch
from vllm import LLM

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enable_sleep_mode=True)

def run_inference(prompt):
    outputs = llm.generate(prompt)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


print("CUDA Memory Usage (after inference):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

run_inference("San Francisco is")
llm.sleep()

print("CUDA Memory Usage (after sleep):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

llm.wake_up()

print("CUDA Memory Usage (after wakeup):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

run_inference("Paris is")

but seems like sleep did not free any memory and after wakeup I got an OOM error:

INFO 02-05 11:19:42 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 54.63 seconds
CUDA Memory Usage (after inference):
torch.cuda.memory_allocated()=168654098432
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.85it/s, est. speed input: 15.41 toks/s, output: 61.64 toks/s]
Prompt: 'San Francisco is', Generated text: ' a top tourist destination, with millions of visitors each year. Whether you’re traveling'
INFO 02-05 11:19:46 worker.py:133] Sleep mode freed 0.00 GiB memory, 160.18 GiB memory is still in use.
CUDA Memory Usage (after sleep):
torch.cuda.memory_allocated()=168654097920
CUDA Error: out of memory at /home/mreso/vllm/csrc/cumem_allocator.cpp:56
Segmentation fault (core dumped)

This is my hw:

=========================================== Concise Hardware Info ============================================
GPU  NODE  DID     GUID   GFX VER  GFX RAS  SDMA RAS  UMC RAS  VBIOS             BUS           PARTITION ID
0    2     0x74a1  51140  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:08:00.0  0
1    5     0x74a1  36564  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:28:00.0  0
2    4     0x74a1  18917  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:48:00.0  0
3    3     0x74a1  28918  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:68:00.0  0
4    8     0x74a1  15111  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:88:00.0  0
5    9     0x74a1  57880  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:A8:00.0  0
6    7     0x74a1  44328  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:C8:00.0  0
7    6     0x74a1  13826  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:E9:00.0  0
==============================================================================================================

Let me know if the test script is incorrect for this or I can try anything else.

HollowMan6 · 2025-02-05T19:35:49Z

Thank you @mreso! It looks like MI300 supports virtual memory management, but it gives an OOM error at a different line and then ends up with a segment fault, maybe Hip APIs should be called in a different way than CUDA. Unfortunately, I can't help with this PR any further as I don't have MI300 in my hand, but thank you anyway, @mreso!

YangWang92 · 2025-02-13T15:51:30Z

Gave it a quick try using this test script:

import torch
from vllm import LLM

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enable_sleep_mode=True)

def run_inference(prompt):
    outputs = llm.generate(prompt)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


print("CUDA Memory Usage (after inference):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

run_inference("San Francisco is")
llm.sleep()

print("CUDA Memory Usage (after sleep):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

llm.wake_up()

print("CUDA Memory Usage (after wakeup):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

run_inference("Paris is")

but seems like sleep did not free any memory and after wakeup I got an OOM error:

INFO 02-05 11:19:42 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 54.63 seconds
CUDA Memory Usage (after inference):
torch.cuda.memory_allocated()=168654098432
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.85it/s, est. speed input: 15.41 toks/s, output: 61.64 toks/s]
Prompt: 'San Francisco is', Generated text: ' a top tourist destination, with millions of visitors each year. Whether you’re traveling'
INFO 02-05 11:19:46 worker.py:133] Sleep mode freed 0.00 GiB memory, 160.18 GiB memory is still in use.
CUDA Memory Usage (after sleep):
torch.cuda.memory_allocated()=168654097920
CUDA Error: out of memory at /home/mreso/vllm/csrc/cumem_allocator.cpp:56
Segmentation fault (core dumped)

This is my hw:

=========================================== Concise Hardware Info ============================================
GPU  NODE  DID     GUID   GFX VER  GFX RAS  SDMA RAS  UMC RAS  VBIOS             BUS           PARTITION ID
0    2     0x74a1  51140  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:08:00.0  0
1    5     0x74a1  36564  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:28:00.0  0
2    4     0x74a1  18917  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:48:00.0  0
3    3     0x74a1  28918  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:68:00.0  0
4    8     0x74a1  15111  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:88:00.0  0
5    9     0x74a1  57880  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:A8:00.0  0
6    7     0x74a1  44328  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:C8:00.0  0
7    6     0x74a1  13826  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:E9:00.0  0
==============================================================================================================

Let me know if the test script is incorrect for this or I can try anything else.

I can also confirm that the current solution does not work on mi300 (on v0.7.2).

INFO 02-13 15:46:13 model_runner.py:1110] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
INFO 02-13 15:46:18 weight_utils.py:252] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.77it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:03,  1.66s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:05<00:02,  2.12s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00,  2.34s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00,  2.08s/it]

INFO 02-13 15:46:27 model_runner.py:1115] Loading model weights took 14.9888 GB
INFO 02-13 15:47:10 worker.py:284] Memory profiling takes 43.04 seconds
INFO 02-13 15:47:10 worker.py:284] the current vLLM instance can use total_gpu_memory (191.45GiB) x gpu_memory_utilization (0.90) = 172.31GiB
INFO 02-13 15:47:10 worker.py:284] model weights take 14.99GiB; non_torch_memory takes 2.21GiB; PyTorch activation peak memory takes 13.50GiB; the rest of the memory reserved for KV Cache is 141.61GiB.
INFO 02-13 15:47:10 executor_base.py:110] # CUDA blocks: 72501, # CPU blocks: 2048
INFO 02-13 15:47:10 executor_base.py:115] Maximum concurrency for 131072 tokens per request: 8.85x
INFO 02-13 15:48:44 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:20<00:00,  1.73it/s]
INFO 02-13 15:49:04 model_runner.py:1562] Graph capturing finished in 20 secs, took 0.20 GiB
INFO 02-13 15:49:04 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 157.36 seconds
CUDA Memory Usage (after inference):
torch.cuda.memory_allocated()=168163364864
Processed prompts: 100%|████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.84s/it, est. speed input: 1.41 toks/s, output: 5.64 toks/s]
Prompt: 'San Francisco is', Generated text: ' a top tourist destination, with millions of visitors each year. Whether you’re traveling'
CUDA Memory Usage (after sleep):
torch.cuda.memory_allocated()=168163364352
CUDA Memory Usage (after wakeup):
torch.cuda.memory_allocated()=168163364352
Processed prompts: 100%|████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.47s/it, est. speed input: 1.22 toks/s, output: 6.49 toks/s]
Prompt: 'Paris is', Generated text: ' a battleground for the culinary world. Street vendors, high-end restaurants, bakers'
[rank0]:[W213 15:49:11.474759635 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of an

YangWang92 · 2025-02-13T15:53:26Z

By the way, feel free to tell me what I can help with on this pull.

mergify · 2025-03-06T01:14:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @HollowMan6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tjtanaa · 2025-03-13T03:30:35Z

Hi @HollowMan6, just a quick feedback data point as I was testing this PR. For V0 inference seems to work
EDIT: Oh and I had to remove "list(APPEND CUMEM_LIBS amdhip64)" in my env, otherwise I got a linker error: /usr/bin/ld: cannot find -lamdhip64.

Hi @mreso! Thank you for your feedback! Are you sure it's working for the sleep/wake without the strange OOM error on AMD with this PR? It looks like you failed to link to the ROCm library, so the sleep/wake shouldn't work here. Remove list(APPEND CUMEM_LIBS amdhip64) is not the correct solution as it will stop it from linking to the library correctly. The issue you have is more like an environment issue, so maybe you can try something like export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH for now, maybe there is a better way to handle this properly in the CMakeLists.txt, but let's first get the hipMemAddressReserve working.

The " from vllm.device_allocator.cumem import CuMemAllocator" import error I was facing on main is fixed but for V1 I get this error:

Oh really? it sounds like they have some failed logic in checking whether you are using NVIDIA or AMD, as in the original codebase, CuMemAllocator is disabled for AMD GPUs. I can't produce this on my side, though, so maybe you want to file a separate issue about this. Regarding the V1 one, neither this PR nor #11743 made any modification to this, so it should be a separate issue, too.

@HollowMan6 @hongxiayang I can't seem to compile due to missing linker -lamdhip64 when building DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm-ellm:sleepamd .

 deprecated: This API is marked as deprecated and might not be supported in future releases. For more details please re[194/1943]
//github.com/ROCm/HIP/blob/develop/docs/reference/deprecated_api_list.md [-Wdeprecated-declarations]                             
57.87    57 |   return hipDevicePrimaryCtxRetain(ctx, dev);                                                                      
57.87       |          ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~                                                                       
57.87 In file included from /app/vllm/csrc/cumem_allocator_compat.h:7,                                                           
57.87                  from /app/vllm/csrc/cumem_allocator.cpp:6:                                                                
57.87 /opt/rocm/include/hip/hip_runtime_api.h:5437:12: note: declared here                                                       
57.87  5437 | hipError_t hipDevicePrimaryCtxRetain(hipCtx_t* pctx, hipDevice_t dev);                                             
57.87       |            ^~~~~~~~~~~~~~~~~~~~~~~~~                                                                               
58.04 [2/28] Linking CXX shared module cumem_allocator.abi3.so                                                                   
58.04 FAILED: cumem_allocator.abi3.so                                                                                            
58.04 : && /usr/bin/c++ -fPIC -Wno-unused-result -O2 -g -DNDEBUG   -shared  -o cumem_allocator.abi3.so CMakeFiles/cumem_allocator
.dir/csrc/cumem_allocator.cpp.o  -Wl,-rpath,/usr/local/lib/python3.12/dist-packages/torch/lib:/opt/rocm-6.3.1/lib:/opt/rocm/lib: 
 -lamdhip64  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch.so  /usr/local/lib/python3.12/dist-packages/torch/lib/lib
c10.so  -Wl,--no-as-needed,"/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so" -Wl,--as-needed  -Wl,--no-as-neede
d,"/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so" -Wl,--as-needed  /usr/local/lib/python3.12/dist-packages/to
rch/lib/libc10_hip.so  /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so  /opt/rocm-6.3.1/lib/libMIOpen.so.1.0.60301  /
opt/rocm/lib/libhiprtc.so.6.3.60301  -ldl  /opt/rocm-6.3.1/lib/libhipblas.so.2.3.60301  /opt/rocm-6.3.1/lib/libhipfft.so.0.1.6030
1  /opt/rocm-6.3.1/lib/libhiprand.so.1.1.60301  /opt/rocm-6.3.1/lib/librocrand.so.1.1.60301  /opt/rocm-6.3.1/lib/libhipsparse.so.
1.1.0.60301  /opt/rocm-6.3.1/lib/libhipsolver.so.0.3.60301  /opt/rocm-6.3.1/lib/libhipblaslt.so.0.13  /opt/rocm/lib/libamdhip64.s
o.6.3.60301  -Wl,--no-as-needed,"/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch.so" -Wl,--as-needed  -Wl,-rpath-link,
/opt/rocm-6.3.1/lib && :
58.04 /usr/bin/ld: cannot find -lamdhip64: No such file or directory
58.04 collect2: error: ld returned 1 exit status

This compilation error is resolved through:

sudo ln -sf /opt/rocm/lib/libamdhip64.so.6.3.60301 /usr/lib/libamdhip64.so
sudo ldconfig

HIP-TESTs of VirtualMemoryManagement

I am using the hip and rocm in the vLLM docker image to build the test

I am getting two failure when running VirtualMemoryManagement tests

# ./VirtualMemoryManagementTest 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
VirtualMemoryManagementTest is a Catch v2.13.4 host application.
Run with -? for options

-------------------------------------------------------------------------------
Unit_hipMemCreate_MapNonContiguousChunks
-------------------------------------------------------------------------------
/app/hip-tests/catch/unit/virtualMemoryManagement/hipMemCreate.cc:272
...............................................................................

/app/hip-tests/catch/unit/virtualMemoryManagement/hipMemCreate.cc:323: FAILED:
  REQUIRE( true == std::equal(B_h.begin(), B_h.end(), C_h.data()) )
with expansion:
  true == false

Memory access fault by GPU node-9 (Agent handle: 0xd13970) on address 0x7f3214c47000. Reason: Unknown.
GPU core dump created: gpucore.28281
-------------------------------------------------------------------------------
Unit_hipMemSetAccess_Vmm2UnifiedMemCpy
-------------------------------------------------------------------------------
/app/hip-tests/catch/unit/virtualMemoryManagement/hipMemSetGetAccess.cc:552
...............................................................................

/app/hip-tests/catch/unit/virtualMemoryManagement/hipMemSetGetAccess.cc:552: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

===============================================================================
test cases:    24 |    22 passed | 2 failed
assertions: 73852 | 73850 passed | 2 failed

Aborted

HIP-TEST setup

   20  export ROCM_BRANCH=rocm-6.3.x
   21  git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip-tests.git
   22  export HIPTESTS_DIR="$(readlink -f hip-tests)"
   23  cd "$HIPTESTS_DIR"
   24  mkdir -p build; cd build
   25  cmake ../catch -DHIP_PLATFORM=amd -DHIP_PATH=/opt/rocm
   26  make build_tests
   27  cd catch_tests/
   28  ls
   29  cd unit/
   30  ls
   31  cd virtualMemoryManagement/
   32  ls
   33  ./VirtualMemoryManagementTest

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Signed-off-by: kf <kuanfu.liu@embeddedllm.com> Signed-off-by: Hollow Man <hollowman@opensuse.org>

…= "fork") -> Callable: Added in 2bb0e1a ? Feel free to revert Signed-off-by: Hollow Man <hollowman@opensuse.org>

hipMemAddressReserve requires the aligment parameter, which is set to 0 in cuMemAddressReserve. Signed-off-by: Hollow Man <hollowman@opensuse.org>

reduce memory chunk size to reduce rocm hang and save more memory. Signed-off-by: Hollow Man <hollowman@opensuse.org>

YangWang92 · 2025-05-16T08:15:13Z

hi @kliuae, may I ask why your code set CHUNK_SIZE to 512 MB and does this setting has some benefits?

kliuae · 2025-05-22T03:20:57Z

hi @kliuae, may I ask why your code set CHUNK_SIZE to 512 MB and does this setting has some benefits?

Hi @YangWang92, our thinking behind this was that we would like to avoid fragmenting the allocation as much as possible, and in our brief testing on this, increasing the chunk size did reduce the first model loading and sleep time. The next power of two beyond 512MB yielded the same OOM symptom as the unchunked version.

YangWang92 · 2025-05-23T09:22:51Z

hi @kliuae, may I ask why your code set CHUNK_SIZE to 512 MB and does this setting has some benefits?

Hi @YangWang92, our thinking behind this was that we would like to avoid fragmenting the allocation as much as possible, and in our brief testing on this, increasing the chunk size did reduce the first model loading and sleep time. The next power of two beyond 512MB yielded the same OOM symptom as the unchunked version.

Thanks for your reply, I found a new issue about the slow initialization of vllm on rocm, which is related to rocm rocr runtime and clr. I think rocm will fix these issues later.

It looks like `actor_rollout_ref.rollout.free_cache_engine` control option only works for vLLM version 0.5.4 and 0.6.3, the sleep / wake up mode in vLLM engine, as well as release / resume memory occupation in SGLang is enabled by default and there's no way to turn them off. While always alllowing inference engine to free cache can be ideal, it's unfortunately not supported on some devices, such as AMD Mi250x, since it doesn't support virtual memory management: vllm-project/vllm#12695 (comment) So we would need to be able to turn it off so that verl can run on those devices. In addition, it looks like we no longer need to enforce eager in latest vLLM when we choose to free cache, so this PR also lifted this restriction. Signed-off-by: Hollow Man <hollowman@opensuse.org>

mergify bot added the ci/build label Feb 3, 2025

hongxiayang added the rocm Related to AMD ROCm label Feb 3, 2025

HollowMan6 force-pushed the sleep_amd branch 2 times, most recently from 15e1489 to 261ee50 Compare February 3, 2025 21:19

HollowMan6 marked this pull request as ready for review February 4, 2025 13:26

HollowMan6 requested a review from tlrmchlsmth as a code owner February 4, 2025 13:26

DarkLight1337 requested a review from hongxiayang February 4, 2025 13:39

HollowMan6 force-pushed the sleep_amd branch 2 times, most recently from f0cc987 to d1948f1 Compare February 15, 2025 14:01

CZWin32768 mentioned this pull request Feb 25, 2025

[Hardware] Support AMD (ROCMm Kernel) volcengine/verl#360

Merged

mergify bot added the needs-rebase label Mar 6, 2025

HollowMan6 force-pushed the sleep_amd branch from d1948f1 to 8a1e046 Compare March 6, 2025 08:42

mergify bot removed the needs-rebase label Mar 6, 2025

kliuae added 3 commits May 14, 2025 19:19

circumvent oom on amd mi300x

355de79

Signed-off-by: Hollow Man <hollowman@opensuse.org>

align chunk size with hardware granularity

982549b

Signed-off-by: Hollow Man <hollowman@opensuse.org>

limit sleep mode to mi300 on rocm

0084f00

Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 force-pushed the sleep_amd branch from ca93b85 to b790734 Compare May 14, 2025 16:20

mergify bot removed the needs-rebase label May 14, 2025

kliuae and others added 7 commits May 14, 2025 19:26

circumvent mixtral8x22b segfault

5f0897f

Signed-off-by: Hollow Man <hollowman@opensuse.org>

use rocm6.3.4

ca56a13

Signed-off-by: Hollow Man <hollowman@opensuse.org>

make compiler able to link to amdhip64

60fa0a7

Signed-off-by: Hollow Man <hollowman@opensuse.org>

increase chunk size and change typing

a8b5bd1

Signed-off-by: kf <kuanfu.liu@embeddedllm.com> Signed-off-by: Hollow Man <hollowman@opensuse.org>

Clean up and remove def create_new_process_for_each_test(method: str …

ebfd9fa

…= "fork") -> Callable: Added in 2bb0e1a ? Feel free to revert Signed-off-by: Hollow Man <hollowman@opensuse.org>

fix alignment bug in cumem_allocator.cpp

33e4132

hipMemAddressReserve requires the aligment parameter, which is set to 0 in cuMemAddressReserve. Signed-off-by: Hollow Man <hollowman@opensuse.org>

reduce memory chunk size

03e2a6f

reduce memory chunk size to reduce rocm hang and save more memory. Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 force-pushed the sleep_amd branch from b790734 to 03e2a6f Compare May 14, 2025 16:27

yushengsu-thu mentioned this pull request May 15, 2025

[Hardware] Actor_rollout_ref.rollout.tensor_model_parallel_size can not > 1 for ROCm Kernel volcengine/verl#1501

Open

YangWang92 mentioned this pull request May 19, 2025

Is it possible to apply hipMM to vLLM sleep mode? ROCm-DS/hipMM#1

Open

This was referenced May 25, 2025

fix ROCm/HIP support for large memory allocation. ExtremeViscent/torch_memory_saver#1

Open

fix ROCm/HIP support for large memory allocation fzyzcjy/torch_memory_saver#14

Open

Uh oh!

[Core][AMD] Migrate fully transparent sleep mode to ROCm platform #12695

Are you sure you want to change the base?

[Core][AMD] Migrate fully transparent sleep mode to ROCm platform #12695

Uh oh!

Conversation

HollowMan6 commented Feb 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 3, 2025

Uh oh!

HollowMan6 commented Feb 3, 2025

Uh oh!

mreso commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youkaichao commented Feb 4, 2025

Uh oh!

HollowMan6 commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HollowMan6 commented Feb 4, 2025

Uh oh!

DarkLight1337 commented Feb 4, 2025

Uh oh!

hongxiayang commented Feb 4, 2025

Uh oh!

mreso commented Feb 5, 2025

Uh oh!

HollowMan6 commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mreso commented Feb 5, 2025

Uh oh!

mreso commented Feb 5, 2025

Uh oh!

HollowMan6 commented Feb 5, 2025

Uh oh!

YangWang92 commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YangWang92 commented Feb 13, 2025

Uh oh!

mergify bot commented Mar 6, 2025

Uh oh!

tjtanaa commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HIP-TESTs of VirtualMemoryManagement

HIP-TEST setup

Uh oh!

YangWang92 commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kliuae commented May 22, 2025

Uh oh!

YangWang92 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HollowMan6 commented Feb 3, 2025 •

edited by github-actions bot

Loading

mreso commented Feb 4, 2025 •

edited

Loading

HollowMan6 commented Feb 4, 2025 •

edited

Loading

HollowMan6 commented Feb 5, 2025 •

edited

Loading

YangWang92 commented Feb 13, 2025 •

edited

Loading

tjtanaa commented Mar 13, 2025 •

edited

Loading

YangWang92 commented May 16, 2025 •

edited

Loading

YangWang92 commented May 23, 2025 •

edited

Loading