[Ray Core] After the ray job is finished, it will stably trigger resource leakage #49999

Moonquakes · 2025-01-22T05:40:27Z

What happened + What you expected to happen

When I was running a task, I found that a type of job can stably trigger the resource leakage problem after the ray job ends. I tested the code part to the most simplified version and provided it below. The Ray Cluster used in the test is configured as a 128c896g worker node, minWorkerNum is 0, maxWorkerNum is 1, and no environment variables are configured. After the job runs for one minute, you need to manually trigger the ray job stop to trigger the exception.

This task will have the following abnormal phenomena, I think it is worth your in-depth investigation to see what bug is triggered.

During the running process, the Resource Status of Overview always shows that the CPU resources are fully occupied, that is, there are 128 tasks running in parallel, but only a dozen tasks can be seen in the running state on the Job Detail page
After the running is completed, the logical resource leakage can be almost stably triggered, and the Resource Status of Overview always shows that the resources are occupied, resulting in the worker node cannot be scaled down
After the running is completed, the ray task leakage can be almost stably triggered, and there will be many pending tasks in the Demands of Resource Status of Overview, which will cause the node to scale up and scale down continuously.

These three problems will occasionally be triggered separately in some other jobs, but in this given code, they can be almost stably triggered at the same time. Please take a look at where the problem occurs. Thank you!

Versions / Dependencies

Ray v2.40.0
Kuberay v1.2.2

Reproduction script

import ray
import time
import random

@ray.remote
def son():
  time.sleep(10)

@ray.remote
def father():
  futures = []
  for i in range(9000):
    futures.append(son.options(memory=random.randint(1, 1024**3)).remote())
  
  res_list = []
  while len(futures) > 0:
    ready_futures, futures = ray.wait(futures, num_returns=1)
    res_list.extend(ray.get(ready_futures))

if __name__ == '__main__':
    ray.init()
    ray.get(father.remote())

Submit job: ray job submit --address=http://localhost:8265 --working-dir=. -- python3 test_resource_leak.py
And execute ray job stop 02000000 --address=http://localhost:8265 after the ray job runs for one minute.

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

jjyao · 2025-01-23T19:08:11Z

I'm able to repro. Thanks for reporting.

Moonquakes · 2025-01-24T00:58:32Z

@jjyao Thanks for your quick confirmation and looking forward to having this fixed!

Moonquakes · 2025-02-05T02:56:42Z

Hi @jjyao , it seems this problem is quite serious, is there any progress or update?

jjyao · 2025-02-05T03:04:01Z

@Moonquakes we figured out the root cause and will fix it soon.

Moonquakes · 2025-02-12T00:59:41Z

Hi @jjyao, I have also done some research on this issue, but have not made much progress. I am curious about the root cause of these phenomena. If it is convenient, can you explain it in more detail? Thank you!

BTW, I noticed this PR (#50280), but there is no relevant description. Is it related to this issue?

Moonquakes · 2025-02-17T01:20:11Z

Hi @jjyao , sorry to bother you, but this problem really bothers our daily use. I would be very grateful if I could understand its root cause or get a quick fix.

jjyao · 2025-02-17T04:26:46Z

Hi @Moonquakes sorry for the late reply. The issue happens because raylet fails to detect the death of the driver immediately (it will eventually detect after hours) and that's because each of the task has unique resource (memory) requirement which makes raylet very slow during scheduling. In your real case, do you really need unique resource requirement for each task?

Moonquakes · 2025-02-17T05:06:10Z

@jjyao Thank you for your reply! Yes, in our scenario, different tasks need to specify different memory to prevent triggering OOM because the file sizes they process are very different. At the same time, we hope to use full resources as much as possible, so we will specify more fine-grained memory requirements.

Are the above three problems all caused by this reason? Because I tested that it seems to also be related to ray.wait(num_returns=1).

In addition, is there any quick fix for the situation you mentioned (For example, patch the judgment logic of raylet to detect whether the driver has ended first, but I am not sure where the code logic of this block is)? We can try it quickly after modification.

jjyao · 2025-02-17T05:12:42Z

@Moonquakes

 res_list = []
  while len(futures) > 0:
    ready_futures, futures = ray.wait(futures, num_returns=1)
    res_list.extend(ray.get(ready_futures))

Are you able to call ray.get() less frequently: only when all objects are ready?

Moonquakes · 2025-02-17T05:16:03Z

@jjyao I can try to modify the logic of this part, but in the actual code, there is downstream logic behind this part, so if it is changed to a one-time ray.get, it will have a certain impact on the overall time of the job, so I prefer to make a patch for this to ensure that the existing logic will not change too much.

jjyao · 2025-02-17T05:17:55Z

Yea that's just a mitigation. I'll get the actual fix out asap.

Moonquakes · 2025-02-17T05:21:41Z

@jjyao Thank you very much for your work! It would be a great help to us if it can be fixed!

Moonquakes · 2025-02-19T08:28:48Z

Hey @jjyao @edoakes , I found that if DEBUG output is turned on (RAY_BACKEND_LOG_LEVEL='debug'), resources will be released quickly, which seems strange.

edoakes · 2025-02-19T16:46:51Z

@Moonquakes thanks for the info, let me test that and see the behavior difference.

Btw, one suggestion to mitigate the issue and also to improve the performance of this generally would be to use "buckets" for the memory requirements. Instead of having a unique requirement for every file, have a predefined set (1MiB, 5MiB, 10MiB, etc.) and then round up for each task.

Moonquakes · 2025-02-20T02:34:05Z

@edoakes Thanks for your feedback!

At the same time, I tried your PR, and it seems that the resources can be recycled correctly now, but I don't understand why changing to calling DisconnectClient can solve this problem. Can you explain the reason in more detail? Thank you very much!

edoakes · 2025-02-20T16:45:43Z

@Moonquakes I will update the PR description to describe the underlying issue. Hoping to get this merged in the next couple of days :)

Moonquakes · 2025-02-21T00:59:35Z

@edoakes Thanks a lot! And It would be great to know the exact reason why the bug arose before this PR (e.g. where the logic slowed down to cause the worker recovery to be blocked, as it seems to be recovering normally after a while), and why it could be released immediately within ten seconds after the PR was added, was it taking a "fast track" to some process that wasn't being blocked?

edoakes · 2025-02-21T01:35:15Z

Basically the issue is there were many pending messages in the unix domain socket that the worker uses to communicate with the raylet. We don't get an error from the socket until the messages are drained. Typically these are drained fast but here because you're using so many unique scheduling classes, processing each message takes a long time and chews through CPU.

The minor code change I made explicitly disconnects the client in the raylet and disregards future messages. This is the path we should've been going through the whole time. We then stop processing the expensive messages immediately when terminating the worker.

Moonquakes · 2025-02-24T02:23:59Z

Hi @edoakes , When executing the above script (about three minutes), this alarm message will be encountered. This seems to cause the stop job to fail. Is it related to the failure to receive socket information?

edoakes · 2025-02-26T15:37:59Z

Hi @edoakes , When executing the above script (about three minutes), this alarm message will be encountered. This seems to cause the stop job to fail. Is it related to the failure to receive socket information?

Was this running on my PR branch or on master? I think it's likely because the raylet is so overloaded.

Have you implemented my suggestion of bucketing the memory resource requirements?

…51033) Currently, the worker fires-and-forgets a disconnect message to the raylet, then goes through its shutdown procedure and closes the socket. However, in #50812 we want to add detect for unexpected disconnects on the socket. With the current procedure the worker may send the disconnect and close its socket immediately, causing the raylet to detect the unexpected disconnect prior to processing the graceful disconnect message. This PR adds a `DisconnectClientReply` that the raylet will send once it has processed the disconnect message. The worker blocks until it receives this reply, then proceeds with the disconnect procedure and closes its socket. ## Related issue number Towards #49999 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Moonquakes · 2025-03-12T00:58:08Z

Hi @edoakes, there seem to be a lot of scattered PRs related to this issue. Can you provide which PRs need to be applied to fix this issue?

Also, can you introduce what logic has been updated to fix it? Thank you!

edoakes · 2025-03-12T15:59:52Z

@Moonquakes this PR is the one that fixes the issue: #50812

It depends on these:

You can test out the fix using the nightly wheels

…ay-project#50812) Adds logic to periodically check for unexpected disconnects from worker processes and proactively disconnect them when it happens. Currently, we only mark a worker as disconnected once we've processed all messages from the socket and get the EOF. In some cases, we may process these messages very slowly or never (see linked issue). The disconnect detection is implemented by periodically `poll`ing for a `SIGHUP` on the file descriptor of all register workers' sockets every 1s (configurable). This is left unimplemented on Windows. ## Related issue number Closes ray-project#49999 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ay-project#51033) Currently, the worker fires-and-forgets a disconnect message to the raylet, then goes through its shutdown procedure and closes the socket. However, in ray-project#50812 we want to add detect for unexpected disconnects on the socket. With the current procedure the worker may send the disconnect and close its socket immediately, causing the raylet to detect the unexpected disconnect prior to processing the graceful disconnect message. This PR adds a `DisconnectClientReply` that the raylet will send once it has processed the disconnect message. The worker blocks until it receives this reply, then proceeds with the disconnect procedure and closes its socket. ## Related issue number Towards ray-project#49999 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ay-project#50812) Adds logic to periodically check for unexpected disconnects from worker processes and proactively disconnect them when it happens. Currently, we only mark a worker as disconnected once we've processed all messages from the socket and get the EOF. In some cases, we may process these messages very slowly or never (see linked issue). The disconnect detection is implemented by periodically `poll`ing for a `SIGHUP` on the file descriptor of all register workers' sockets every 1s (configurable). This is left unimplemented on Windows. ## Related issue number Closes ray-project#49999 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ay-project#51033) Currently, the worker fires-and-forgets a disconnect message to the raylet, then goes through its shutdown procedure and closes the socket. However, in ray-project#50812 we want to add detect for unexpected disconnects on the socket. With the current procedure the worker may send the disconnect and close its socket immediately, causing the raylet to detect the unexpected disconnect prior to processing the graceful disconnect message. This PR adds a `DisconnectClientReply` that the raylet will send once it has processed the disconnect message. The worker blocks until it receives this reply, then proceeds with the disconnect procedure and closes its socket. ## Related issue number Towards ray-project#49999 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>

…ay-project#50812) Adds logic to periodically check for unexpected disconnects from worker processes and proactively disconnect them when it happens. Currently, we only mark a worker as disconnected once we've processed all messages from the socket and get the EOF. In some cases, we may process these messages very slowly or never (see linked issue). The disconnect detection is implemented by periodically `poll`ing for a `SIGHUP` on the file descriptor of all register workers' sockets every 1s (configurable). This is left unimplemented on Windows. ## Related issue number Closes ray-project#49999 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>

…ay-project#51033) Currently, the worker fires-and-forgets a disconnect message to the raylet, then goes through its shutdown procedure and closes the socket. However, in ray-project#50812 we want to add detect for unexpected disconnects on the socket. With the current procedure the worker may send the disconnect and close its socket immediately, causing the raylet to detect the unexpected disconnect prior to processing the graceful disconnect message. This PR adds a `DisconnectClientReply` that the raylet will send once it has processed the disconnect message. The worker blocks until it receives this reply, then proceeds with the disconnect procedure and closes its socket. ## Related issue number Towards ray-project#49999 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>

…ay-project#50812) Adds logic to periodically check for unexpected disconnects from worker processes and proactively disconnect them when it happens. Currently, we only mark a worker as disconnected once we've processed all messages from the socket and get the EOF. In some cases, we may process these messages very slowly or never (see linked issue). The disconnect detection is implemented by periodically `poll`ing for a `SIGHUP` on the file descriptor of all register workers' sockets every 1s (configurable). This is left unimplemented on Windows. ## Related issue number Closes ray-project#49999 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>

…ay-project#51033) Currently, the worker fires-and-forgets a disconnect message to the raylet, then goes through its shutdown procedure and closes the socket. However, in ray-project#50812 we want to add detect for unexpected disconnects on the socket. With the current procedure the worker may send the disconnect and close its socket immediately, causing the raylet to detect the unexpected disconnect prior to processing the graceful disconnect message. This PR adds a `DisconnectClientReply` that the raylet will send once it has processed the disconnect message. The worker blocks until it receives this reply, then proceeds with the disconnect procedure and closes its socket. ## Related issue number Towards ray-project#49999 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ay-project#50812) Adds logic to periodically check for unexpected disconnects from worker processes and proactively disconnect them when it happens. Currently, we only mark a worker as disconnected once we've processed all messages from the socket and get the EOF. In some cases, we may process these messages very slowly or never (see linked issue). The disconnect detection is implemented by periodically `poll`ing for a `SIGHUP` on the file descriptor of all register workers' sockets every 1s (configurable). This is left unimplemented on Windows. ## Related issue number Closes ray-project#49999 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Moonquakes added bug triage labels Jan 22, 2025

jcotant1 added the core label Jan 22, 2025

jjyao added P0 and removed triage labels Jan 22, 2025

jjyao assigned edoakes Feb 19, 2025

edoakes mentioned this issue Feb 19, 2025

[core] Use DestroyWorker to disconnect raylet client when killing leased workers #50736

Closed

8 tasks

edoakes mentioned this issue Feb 26, 2025

[core] Periodically check for unexpected worker socket disconnects #50812

Merged

8 tasks

edoakes mentioned this issue Mar 3, 2025

[core] Wait for DisconnectClientReply in worker shutdown sequence #51033

Merged

8 tasks

edoakes closed this as completed in #50812 Mar 11, 2025

edoakes closed this as completed in cbb61d2 Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray Core] After the ray job is finished, it will stably trigger resource leakage #49999

[Ray Core] After the ray job is finished, it will stably trigger resource leakage #49999

Moonquakes commented Jan 22, 2025 •

edited

Loading

jjyao commented Jan 23, 2025

Moonquakes commented Jan 24, 2025

Moonquakes commented Feb 5, 2025

jjyao commented Feb 5, 2025

Moonquakes commented Feb 12, 2025

Moonquakes commented Feb 17, 2025

jjyao commented Feb 17, 2025

Moonquakes commented Feb 17, 2025 •

edited

Loading

jjyao commented Feb 17, 2025

Moonquakes commented Feb 17, 2025 •

edited

Loading

jjyao commented Feb 17, 2025

Moonquakes commented Feb 17, 2025

Moonquakes commented Feb 19, 2025 •

edited

Loading

edoakes commented Feb 19, 2025

Moonquakes commented Feb 20, 2025 •

edited

Loading

edoakes commented Feb 20, 2025

Moonquakes commented Feb 21, 2025

edoakes commented Feb 21, 2025

Moonquakes commented Feb 24, 2025

edoakes commented Feb 26, 2025 •

edited

Loading

Moonquakes commented Mar 12, 2025 •

edited

Loading

edoakes commented Mar 12, 2025

[Ray Core] After the ray job is finished, it will stably trigger resource leakage #49999

[Ray Core] After the ray job is finished, it will stably trigger resource leakage #49999

Comments

Moonquakes commented Jan 22, 2025 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

jjyao commented Jan 23, 2025

Moonquakes commented Jan 24, 2025

Moonquakes commented Feb 5, 2025

jjyao commented Feb 5, 2025

Moonquakes commented Feb 12, 2025

Moonquakes commented Feb 17, 2025

jjyao commented Feb 17, 2025

Moonquakes commented Feb 17, 2025 • edited Loading

jjyao commented Feb 17, 2025

Moonquakes commented Feb 17, 2025 • edited Loading

jjyao commented Feb 17, 2025

Moonquakes commented Feb 17, 2025

Moonquakes commented Feb 19, 2025 • edited Loading

edoakes commented Feb 19, 2025

Moonquakes commented Feb 20, 2025 • edited Loading

edoakes commented Feb 20, 2025

Moonquakes commented Feb 21, 2025

edoakes commented Feb 21, 2025

Moonquakes commented Feb 24, 2025

edoakes commented Feb 26, 2025 • edited Loading

Moonquakes commented Mar 12, 2025 • edited Loading

edoakes commented Mar 12, 2025

Moonquakes commented Jan 22, 2025 •

edited

Loading

Moonquakes commented Feb 17, 2025 •

edited

Loading

Moonquakes commented Feb 17, 2025 •

edited

Loading

Moonquakes commented Feb 19, 2025 •

edited

Loading

Moonquakes commented Feb 20, 2025 •

edited

Loading

edoakes commented Feb 26, 2025 •

edited

Loading

Moonquakes commented Mar 12, 2025 •

edited

Loading