Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Use the correct way to check whether an actor task is running or not #51158

Merged
merged 3 commits into from
Mar 8, 2025

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Mar 7, 2025

Why are these changes needed?

Ray core worker processes change the process title by calling setproctitle before and after executing tasks. Then, verify_tasks_running_or_terminated uses the following logic to check whether the Ray actor task is running or not:

if psutil.pid_exists(pid) and task_name in psutil.Process(pid).name():

However, the logic is not correct in some cases. To understand the issue, we need to understand the behavior of setproctitle and psutil.Process(pid).name().

  • setproctitle:

    • It calls functions like prctl based on the OS and arch under the hood. It updates /proc/pid/task/tid/comm of the calling thread.
    • If the thread is the leader thread (i.e. main thread in Python), /proc/pid/comm will also be updated.
    • /proc/pid/cmdline and /proc/pid/task/tid/cmdline will always be updated based on my observation.
    • Note that comm has a length limitation (15 chars)
  • psutil.Process(pid).name()

If the actor task is not running on the main thread (in which case setproctitle will not update /proc/pid/comm), and if users pass an options.name to the actor that doesn't start with /proc/pid/comm, then task_name in psutil.Process(pid).name() will evaluate to false, which is incorrect.

def test_default_thread_options_name_long_comm_2(self, ray_start_regular):
    """
    `/proc/PID/comm` is truncated to 15 characters, so the process title
    is "ray::VeryLongCo". `psutil.Process.name()` doesn't return `cmdline()[0]`
    because `cmdline()[0]` doesn't start with "ray::VeryLongCo". This is the
    implementation detail of `psutil`.
    """
    task_name = "hello"

    @ray.remote(concurrency_groups={"io": 1})
    class VeryLongCommActor:
        def check_long_comm(self):
            pid = os.getpid()
            assert _is_actor_task_running(pid, task_name)
            # The first 15 characters of "ray::VeryLongCommActor"
            assert psutil.Process(pid).name() == "ray::VeryLongCo"
            assert psutil.Process(pid).cmdline()[0] == f"ray::{task_name}"
            return pid

    a = VeryLongCommActor.remote()
    pid = ray.get(a.check_long_comm.options(name=task_name).remote())
    wait_for_condition(lambda: not _is_actor_task_running(pid, task_name))

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
@kevin85421 kevin85421 force-pushed the 20250306-devbox1-tmux-3-ray7 branch from 3d77e1b to 9ef035b Compare March 7, 2025 19:33
@kevin85421 kevin85421 marked this pull request as ready for review March 7, 2025 19:59
@kevin85421
Copy link
Member Author

This is to fix #51058 (comment).

Signed-off-by: kaihsun <kaihsun@anyscale.com>
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Mar 7, 2025
@kevin85421 kevin85421 requested a review from edoakes March 7, 2025 22:57
@edoakes edoakes merged commit 4568beb into master Mar 8, 2025
6 checks passed
@edoakes edoakes deleted the 20250306-devbox1-tmux-3-ray7 branch March 8, 2025 00:37
elimelt pushed a commit to elimelt/ray that referenced this pull request Mar 9, 2025
…or not (ray-project#51158)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Ray core worker processes change the process title by calling
`setproctitle` before and after executing tasks. Then,
`verify_tasks_running_or_terminated` uses the following logic to check
whether the Ray actor task is running or not:

```python
if psutil.pid_exists(pid) and task_name in psutil.Process(pid).name():
```

However, the logic is not correct in some cases. To understand the
issue, we need to understand the behavior of `setproctitle` and
`psutil.Process(pid).name()`.

* `setproctitle`:
* It calls functions like
[prctl](https://github.com/dvarrazzo/py-setproctitle/blob/cacf96fafa3da1cd1a5b131b4f8b9997c01518d5/src/spt_status.c#L390)
based on the OS and arch under the hood. It updates
`/proc/pid/task/tid/comm` of the calling thread.
* If the thread is the leader thread (i.e. main thread in Python),
`/proc/pid/comm` will also be updated.
* `/proc/pid/cmdline` and `/proc/pid/task/tid/cmdline` will always be
updated based on my observation.
  * Note that `comm` has a length limitation (15 chars)

* `psutil.Process(pid).name()`
*
https://github.com/giampaolo/psutil/blob/a17550784b0d3175da01cdb02cee1bc6b61637dc/psutil/__init__.py#L664-L693
  * Try to retrieve `/proc/pid/comm`.
* If `/proc/pid/comm` is truncated, attempt to retrieve the first
element of `/proc/pid/cmdline`; otherwise, return `/proc/pid/comm`.
* If `cmdline[0]` begins with the contents of `/proc/pid/comm`, return
`cmdline[0]`; otherwise, return the truncated `/proc/pid/comm`.

If the actor task is not running on the main thread (in which case
`setproctitle` will not update `/proc/pid/comm`), and if users pass an
`options.name` to the actor that doesn't start with `/proc/pid/comm`,
then `task_name in psutil.Process(pid).name()` will evaluate to false,
which is incorrect.

```python
def test_default_thread_options_name_long_comm_2(self, ray_start_regular):
    """
    `/proc/PID/comm` is truncated to 15 characters, so the process title
    is "ray::VeryLongCo". `psutil.Process.name()` doesn't return `cmdline()[0]`
    because `cmdline()[0]` doesn't start with "ray::VeryLongCo". This is the
    implementation detail of `psutil`.
    """
    task_name = "hello"

    @ray.remote(concurrency_groups={"io": 1})
    class VeryLongCommActor:
        def check_long_comm(self):
            pid = os.getpid()
            assert _is_actor_task_running(pid, task_name)
            # The first 15 characters of "ray::VeryLongCommActor"
            assert psutil.Process(pid).name() == "ray::VeryLongCo"
            assert psutil.Process(pid).cmdline()[0] == f"ray::{task_name}"
            return pid

    a = VeryLongCommActor.remote()
    pid = ray.get(a.check_long_comm.options(name=task_name).remote())
    wait_for_condition(lambda: not _is_actor_task_running(pid, task_name))
```


## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
edoakes pushed a commit that referenced this pull request Mar 11, 2025
#51158 adds tests for process titles. However, the definition of "title"
varies across different operating systems, and the behavior of
`psutil.Process().name()` differs as well.


```
[ref]: https://github.com/dvarrazzo/py-setproctitle
> The process title is usually visible in files such as /proc/PID/cmdline,
/proc/PID/status, /proc/PID/comm, depending on the operating system and
kernel version. This information is used by user-space tools such as ps
and top.
```

For example, in Linux, the `/proc/pid/comm` can't be more than 15 chars.
However, MacOS doesn't have `/proc/pid/comm`. In the following test, the
name of the process in Mac is "python3.9", but the name on Linux is
"ray::A.check".

```python
    def test_main_thread_short_comm(self, ray_start_regular):
        """
        Test that the main thread's comm is not truncated.
        """
        @ray.remote
        class A:
            def check(self):
                pid = os.getpid()
                assert _is_actor_task_running(pid, "check")
                assert psutil.Process(pid).name() == "ray::A.check"
                assert psutil.Process(pid).cmdline()[0] == "ray::A.check"
                return pid
        a = A.remote()
        pid = ray.get(a.check.remote())
        wait_for_condition(lambda: not _is_actor_task_running(pid, "check"))
```

## Related issue number

Closes #51182
Closes #45966

On my Mac M3

<img width="1728" alt="image"
src="https://github.com/user-attachments/assets/c8c0addf-9c18-402d-bafd-135ff8f6ea7a"
/>

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
qinyiyan pushed a commit to qinyiyan/ray that referenced this pull request Mar 13, 2025
…1229)

ray-project#51158 adds tests for process titles. However, the definition of "title"
varies across different operating systems, and the behavior of
`psutil.Process().name()` differs as well.


```
[ref]: https://github.com/dvarrazzo/py-setproctitle
> The process title is usually visible in files such as /proc/PID/cmdline,
/proc/PID/status, /proc/PID/comm, depending on the operating system and
kernel version. This information is used by user-space tools such as ps
and top.
```

For example, in Linux, the `/proc/pid/comm` can't be more than 15 chars.
However, MacOS doesn't have `/proc/pid/comm`. In the following test, the
name of the process in Mac is "python3.9", but the name on Linux is
"ray::A.check".

```python
    def test_main_thread_short_comm(self, ray_start_regular):
        """
        Test that the main thread's comm is not truncated.
        """
        @ray.remote
        class A:
            def check(self):
                pid = os.getpid()
                assert _is_actor_task_running(pid, "check")
                assert psutil.Process(pid).name() == "ray::A.check"
                assert psutil.Process(pid).cmdline()[0] == "ray::A.check"
                return pid
        a = A.remote()
        pid = ray.get(a.check.remote())
        wait_for_condition(lambda: not _is_actor_task_running(pid, "check"))
```

## Related issue number

Closes ray-project#51182
Closes ray-project#45966

On my Mac M3

<img width="1728" alt="image"
src="https://github.com/user-attachments/assets/c8c0addf-9c18-402d-bafd-135ff8f6ea7a"
/>

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
…or not (ray-project#51158)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Ray core worker processes change the process title by calling
`setproctitle` before and after executing tasks. Then,
`verify_tasks_running_or_terminated` uses the following logic to check
whether the Ray actor task is running or not:

```python
if psutil.pid_exists(pid) and task_name in psutil.Process(pid).name():
```

However, the logic is not correct in some cases. To understand the
issue, we need to understand the behavior of `setproctitle` and
`psutil.Process(pid).name()`.

* `setproctitle`:
* It calls functions like
[prctl](https://github.com/dvarrazzo/py-setproctitle/blob/cacf96fafa3da1cd1a5b131b4f8b9997c01518d5/src/spt_status.c#L390)
based on the OS and arch under the hood. It updates
`/proc/pid/task/tid/comm` of the calling thread.
* If the thread is the leader thread (i.e. main thread in Python),
`/proc/pid/comm` will also be updated.
* `/proc/pid/cmdline` and `/proc/pid/task/tid/cmdline` will always be
updated based on my observation.
  * Note that `comm` has a length limitation (15 chars)

* `psutil.Process(pid).name()`
*
https://github.com/giampaolo/psutil/blob/a17550784b0d3175da01cdb02cee1bc6b61637dc/psutil/__init__.py#L664-L693
  * Try to retrieve `/proc/pid/comm`.
* If `/proc/pid/comm` is truncated, attempt to retrieve the first
element of `/proc/pid/cmdline`; otherwise, return `/proc/pid/comm`.
* If `cmdline[0]` begins with the contents of `/proc/pid/comm`, return
`cmdline[0]`; otherwise, return the truncated `/proc/pid/comm`.

If the actor task is not running on the main thread (in which case
`setproctitle` will not update `/proc/pid/comm`), and if users pass an
`options.name` to the actor that doesn't start with `/proc/pid/comm`,
then `task_name in psutil.Process(pid).name()` will evaluate to false,
which is incorrect.

```python
def test_default_thread_options_name_long_comm_2(self, ray_start_regular):
    """
    `/proc/PID/comm` is truncated to 15 characters, so the process title
    is "ray::VeryLongCo". `psutil.Process.name()` doesn't return `cmdline()[0]`
    because `cmdline()[0]` doesn't start with "ray::VeryLongCo". This is the
    implementation detail of `psutil`.
    """
    task_name = "hello"

    @ray.remote(concurrency_groups={"io": 1})
    class VeryLongCommActor:
        def check_long_comm(self):
            pid = os.getpid()
            assert _is_actor_task_running(pid, task_name)
            # The first 15 characters of "ray::VeryLongCommActor"
            assert psutil.Process(pid).name() == "ray::VeryLongCo"
            assert psutil.Process(pid).cmdline()[0] == f"ray::{task_name}"
            return pid

    a = VeryLongCommActor.remote()
    pid = ray.get(a.check_long_comm.options(name=task_name).remote())
    wait_for_condition(lambda: not _is_actor_task_running(pid, task_name))
```


## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
…1229)

ray-project#51158 adds tests for process titles. However, the definition of "title"
varies across different operating systems, and the behavior of
`psutil.Process().name()` differs as well.


```
[ref]: https://github.com/dvarrazzo/py-setproctitle
> The process title is usually visible in files such as /proc/PID/cmdline,
/proc/PID/status, /proc/PID/comm, depending on the operating system and
kernel version. This information is used by user-space tools such as ps
and top.
```

For example, in Linux, the `/proc/pid/comm` can't be more than 15 chars.
However, MacOS doesn't have `/proc/pid/comm`. In the following test, the
name of the process in Mac is "python3.9", but the name on Linux is
"ray::A.check".

```python
    def test_main_thread_short_comm(self, ray_start_regular):
        """
        Test that the main thread's comm is not truncated.
        """
        @ray.remote
        class A:
            def check(self):
                pid = os.getpid()
                assert _is_actor_task_running(pid, "check")
                assert psutil.Process(pid).name() == "ray::A.check"
                assert psutil.Process(pid).cmdline()[0] == "ray::A.check"
                return pid
        a = A.remote()
        pid = ray.get(a.check.remote())
        wait_for_condition(lambda: not _is_actor_task_running(pid, "check"))
```

## Related issue number

Closes ray-project#51182
Closes ray-project#45966

On my Mac M3

<img width="1728" alt="image"
src="https://github.com/user-attachments/assets/c8c0addf-9c18-402d-bafd-135ff8f6ea7a"
/>

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
jaychia pushed a commit to jaychia/ray that referenced this pull request Mar 19, 2025
…or not (ray-project#51158)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Ray core worker processes change the process title by calling
`setproctitle` before and after executing tasks. Then,
`verify_tasks_running_or_terminated` uses the following logic to check
whether the Ray actor task is running or not:

```python
if psutil.pid_exists(pid) and task_name in psutil.Process(pid).name():
```

However, the logic is not correct in some cases. To understand the
issue, we need to understand the behavior of `setproctitle` and
`psutil.Process(pid).name()`.

* `setproctitle`:
* It calls functions like
[prctl](https://github.com/dvarrazzo/py-setproctitle/blob/cacf96fafa3da1cd1a5b131b4f8b9997c01518d5/src/spt_status.c#L390)
based on the OS and arch under the hood. It updates
`/proc/pid/task/tid/comm` of the calling thread.
* If the thread is the leader thread (i.e. main thread in Python),
`/proc/pid/comm` will also be updated.
* `/proc/pid/cmdline` and `/proc/pid/task/tid/cmdline` will always be
updated based on my observation.
  * Note that `comm` has a length limitation (15 chars)

* `psutil.Process(pid).name()`
*
https://github.com/giampaolo/psutil/blob/a17550784b0d3175da01cdb02cee1bc6b61637dc/psutil/__init__.py#L664-L693
  * Try to retrieve `/proc/pid/comm`.
* If `/proc/pid/comm` is truncated, attempt to retrieve the first
element of `/proc/pid/cmdline`; otherwise, return `/proc/pid/comm`.
* If `cmdline[0]` begins with the contents of `/proc/pid/comm`, return
`cmdline[0]`; otherwise, return the truncated `/proc/pid/comm`.

If the actor task is not running on the main thread (in which case
`setproctitle` will not update `/proc/pid/comm`), and if users pass an
`options.name` to the actor that doesn't start with `/proc/pid/comm`,
then `task_name in psutil.Process(pid).name()` will evaluate to false,
which is incorrect.

```python
def test_default_thread_options_name_long_comm_2(self, ray_start_regular):
    """
    `/proc/PID/comm` is truncated to 15 characters, so the process title
    is "ray::VeryLongCo". `psutil.Process.name()` doesn't return `cmdline()[0]`
    because `cmdline()[0]` doesn't start with "ray::VeryLongCo". This is the
    implementation detail of `psutil`.
    """
    task_name = "hello"

    @ray.remote(concurrency_groups={"io": 1})
    class VeryLongCommActor:
        def check_long_comm(self):
            pid = os.getpid()
            assert _is_actor_task_running(pid, task_name)
            # The first 15 characters of "ray::VeryLongCommActor"
            assert psutil.Process(pid).name() == "ray::VeryLongCo"
            assert psutil.Process(pid).cmdline()[0] == f"ray::{task_name}"
            return pid

    a = VeryLongCommActor.remote()
    pid = ray.get(a.check_long_comm.options(name=task_name).remote())
    wait_for_condition(lambda: not _is_actor_task_running(pid, task_name))
```

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>
jaychia pushed a commit to jaychia/ray that referenced this pull request Mar 19, 2025
…1229)

ray-project#51158 adds tests for process titles. However, the definition of "title"
varies across different operating systems, and the behavior of
`psutil.Process().name()` differs as well.

```
[ref]: https://github.com/dvarrazzo/py-setproctitle
> The process title is usually visible in files such as /proc/PID/cmdline,
/proc/PID/status, /proc/PID/comm, depending on the operating system and
kernel version. This information is used by user-space tools such as ps
and top.
```

For example, in Linux, the `/proc/pid/comm` can't be more than 15 chars.
However, MacOS doesn't have `/proc/pid/comm`. In the following test, the
name of the process in Mac is "python3.9", but the name on Linux is
"ray::A.check".

```python
    def test_main_thread_short_comm(self, ray_start_regular):
        """
        Test that the main thread's comm is not truncated.
        """
        @ray.remote
        class A:
            def check(self):
                pid = os.getpid()
                assert _is_actor_task_running(pid, "check")
                assert psutil.Process(pid).name() == "ray::A.check"
                assert psutil.Process(pid).cmdline()[0] == "ray::A.check"
                return pid
        a = A.remote()
        pid = ray.get(a.check.remote())
        wait_for_condition(lambda: not _is_actor_task_running(pid, "check"))
```

## Related issue number

Closes ray-project#51182
Closes ray-project#45966

On my Mac M3

<img width="1728" alt="image"
src="https://github.com/user-attachments/assets/c8c0addf-9c18-402d-bafd-135ff8f6ea7a"
/>

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>
jaychia pushed a commit to jaychia/ray that referenced this pull request Mar 19, 2025
…or not (ray-project#51158)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Ray core worker processes change the process title by calling
`setproctitle` before and after executing tasks. Then,
`verify_tasks_running_or_terminated` uses the following logic to check
whether the Ray actor task is running or not:

```python
if psutil.pid_exists(pid) and task_name in psutil.Process(pid).name():
```

However, the logic is not correct in some cases. To understand the
issue, we need to understand the behavior of `setproctitle` and
`psutil.Process(pid).name()`.

* `setproctitle`:
* It calls functions like
[prctl](https://github.com/dvarrazzo/py-setproctitle/blob/cacf96fafa3da1cd1a5b131b4f8b9997c01518d5/src/spt_status.c#L390)
based on the OS and arch under the hood. It updates
`/proc/pid/task/tid/comm` of the calling thread.
* If the thread is the leader thread (i.e. main thread in Python),
`/proc/pid/comm` will also be updated.
* `/proc/pid/cmdline` and `/proc/pid/task/tid/cmdline` will always be
updated based on my observation.
  * Note that `comm` has a length limitation (15 chars)

* `psutil.Process(pid).name()`
*
https://github.com/giampaolo/psutil/blob/a17550784b0d3175da01cdb02cee1bc6b61637dc/psutil/__init__.py#L664-L693
  * Try to retrieve `/proc/pid/comm`.
* If `/proc/pid/comm` is truncated, attempt to retrieve the first
element of `/proc/pid/cmdline`; otherwise, return `/proc/pid/comm`.
* If `cmdline[0]` begins with the contents of `/proc/pid/comm`, return
`cmdline[0]`; otherwise, return the truncated `/proc/pid/comm`.

If the actor task is not running on the main thread (in which case
`setproctitle` will not update `/proc/pid/comm`), and if users pass an
`options.name` to the actor that doesn't start with `/proc/pid/comm`,
then `task_name in psutil.Process(pid).name()` will evaluate to false,
which is incorrect.

```python
def test_default_thread_options_name_long_comm_2(self, ray_start_regular):
    """
    `/proc/PID/comm` is truncated to 15 characters, so the process title
    is "ray::VeryLongCo". `psutil.Process.name()` doesn't return `cmdline()[0]`
    because `cmdline()[0]` doesn't start with "ray::VeryLongCo". This is the
    implementation detail of `psutil`.
    """
    task_name = "hello"

    @ray.remote(concurrency_groups={"io": 1})
    class VeryLongCommActor:
        def check_long_comm(self):
            pid = os.getpid()
            assert _is_actor_task_running(pid, task_name)
            # The first 15 characters of "ray::VeryLongCommActor"
            assert psutil.Process(pid).name() == "ray::VeryLongCo"
            assert psutil.Process(pid).cmdline()[0] == f"ray::{task_name}"
            return pid

    a = VeryLongCommActor.remote()
    pid = ray.get(a.check_long_comm.options(name=task_name).remote())
    wait_for_condition(lambda: not _is_actor_task_running(pid, task_name))
```

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>
jaychia pushed a commit to jaychia/ray that referenced this pull request Mar 19, 2025
…1229)

ray-project#51158 adds tests for process titles. However, the definition of "title"
varies across different operating systems, and the behavior of
`psutil.Process().name()` differs as well.

```
[ref]: https://github.com/dvarrazzo/py-setproctitle
> The process title is usually visible in files such as /proc/PID/cmdline,
/proc/PID/status, /proc/PID/comm, depending on the operating system and
kernel version. This information is used by user-space tools such as ps
and top.
```

For example, in Linux, the `/proc/pid/comm` can't be more than 15 chars.
However, MacOS doesn't have `/proc/pid/comm`. In the following test, the
name of the process in Mac is "python3.9", but the name on Linux is
"ray::A.check".

```python
    def test_main_thread_short_comm(self, ray_start_regular):
        """
        Test that the main thread's comm is not truncated.
        """
        @ray.remote
        class A:
            def check(self):
                pid = os.getpid()
                assert _is_actor_task_running(pid, "check")
                assert psutil.Process(pid).name() == "ray::A.check"
                assert psutil.Process(pid).cmdline()[0] == "ray::A.check"
                return pid
        a = A.remote()
        pid = ray.get(a.check.remote())
        wait_for_condition(lambda: not _is_actor_task_running(pid, "check"))
```

## Related issue number

Closes ray-project#51182
Closes ray-project#45966

On my Mac M3

<img width="1728" alt="image"
src="https://github.com/user-attachments/assets/c8c0addf-9c18-402d-bafd-135ff8f6ea7a"
/>

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>
Drice1999 pushed a commit to Drice1999/ray that referenced this pull request Mar 23, 2025
…or not (ray-project#51158)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Ray core worker processes change the process title by calling
`setproctitle` before and after executing tasks. Then,
`verify_tasks_running_or_terminated` uses the following logic to check
whether the Ray actor task is running or not:

```python
if psutil.pid_exists(pid) and task_name in psutil.Process(pid).name():
```

However, the logic is not correct in some cases. To understand the
issue, we need to understand the behavior of `setproctitle` and
`psutil.Process(pid).name()`.

* `setproctitle`:
* It calls functions like
[prctl](https://github.com/dvarrazzo/py-setproctitle/blob/cacf96fafa3da1cd1a5b131b4f8b9997c01518d5/src/spt_status.c#L390)
based on the OS and arch under the hood. It updates
`/proc/pid/task/tid/comm` of the calling thread.
* If the thread is the leader thread (i.e. main thread in Python),
`/proc/pid/comm` will also be updated.
* `/proc/pid/cmdline` and `/proc/pid/task/tid/cmdline` will always be
updated based on my observation.
  * Note that `comm` has a length limitation (15 chars)

* `psutil.Process(pid).name()`
*
https://github.com/giampaolo/psutil/blob/a17550784b0d3175da01cdb02cee1bc6b61637dc/psutil/__init__.py#L664-L693
  * Try to retrieve `/proc/pid/comm`.
* If `/proc/pid/comm` is truncated, attempt to retrieve the first
element of `/proc/pid/cmdline`; otherwise, return `/proc/pid/comm`.
* If `cmdline[0]` begins with the contents of `/proc/pid/comm`, return
`cmdline[0]`; otherwise, return the truncated `/proc/pid/comm`.

If the actor task is not running on the main thread (in which case
`setproctitle` will not update `/proc/pid/comm`), and if users pass an
`options.name` to the actor that doesn't start with `/proc/pid/comm`,
then `task_name in psutil.Process(pid).name()` will evaluate to false,
which is incorrect.

```python
def test_default_thread_options_name_long_comm_2(self, ray_start_regular):
    """
    `/proc/PID/comm` is truncated to 15 characters, so the process title
    is "ray::VeryLongCo". `psutil.Process.name()` doesn't return `cmdline()[0]`
    because `cmdline()[0]` doesn't start with "ray::VeryLongCo". This is the
    implementation detail of `psutil`.
    """
    task_name = "hello"

    @ray.remote(concurrency_groups={"io": 1})
    class VeryLongCommActor:
        def check_long_comm(self):
            pid = os.getpid()
            assert _is_actor_task_running(pid, task_name)
            # The first 15 characters of "ray::VeryLongCommActor"
            assert psutil.Process(pid).name() == "ray::VeryLongCo"
            assert psutil.Process(pid).cmdline()[0] == f"ray::{task_name}"
            return pid

    a = VeryLongCommActor.remote()
    pid = ray.get(a.check_long_comm.options(name=task_name).remote())
    wait_for_condition(lambda: not _is_actor_task_running(pid, task_name))
```


## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Drice1999 pushed a commit to Drice1999/ray that referenced this pull request Mar 23, 2025
…1229)

ray-project#51158 adds tests for process titles. However, the definition of "title"
varies across different operating systems, and the behavior of
`psutil.Process().name()` differs as well.


```
[ref]: https://github.com/dvarrazzo/py-setproctitle
> The process title is usually visible in files such as /proc/PID/cmdline,
/proc/PID/status, /proc/PID/comm, depending on the operating system and
kernel version. This information is used by user-space tools such as ps
and top.
```

For example, in Linux, the `/proc/pid/comm` can't be more than 15 chars.
However, MacOS doesn't have `/proc/pid/comm`. In the following test, the
name of the process in Mac is "python3.9", but the name on Linux is
"ray::A.check".

```python
    def test_main_thread_short_comm(self, ray_start_regular):
        """
        Test that the main thread's comm is not truncated.
        """
        @ray.remote
        class A:
            def check(self):
                pid = os.getpid()
                assert _is_actor_task_running(pid, "check")
                assert psutil.Process(pid).name() == "ray::A.check"
                assert psutil.Process(pid).cmdline()[0] == "ray::A.check"
                return pid
        a = A.remote()
        pid = ray.get(a.check.remote())
        wait_for_condition(lambda: not _is_actor_task_running(pid, "check"))
```

## Related issue number

Closes ray-project#51182
Closes ray-project#45966

On my Mac M3

<img width="1728" alt="image"
src="https://github.com/user-attachments/assets/c8c0addf-9c18-402d-bafd-135ff8f6ea7a"
/>

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
dhakshin32 pushed a commit to dhakshin32/ray that referenced this pull request Mar 27, 2025
…or not (ray-project#51158)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Ray core worker processes change the process title by calling
`setproctitle` before and after executing tasks. Then,
`verify_tasks_running_or_terminated` uses the following logic to check
whether the Ray actor task is running or not:

```python
if psutil.pid_exists(pid) and task_name in psutil.Process(pid).name():
```

However, the logic is not correct in some cases. To understand the
issue, we need to understand the behavior of `setproctitle` and
`psutil.Process(pid).name()`.

* `setproctitle`:
* It calls functions like
[prctl](https://github.com/dvarrazzo/py-setproctitle/blob/cacf96fafa3da1cd1a5b131b4f8b9997c01518d5/src/spt_status.c#L390)
based on the OS and arch under the hood. It updates
`/proc/pid/task/tid/comm` of the calling thread.
* If the thread is the leader thread (i.e. main thread in Python),
`/proc/pid/comm` will also be updated.
* `/proc/pid/cmdline` and `/proc/pid/task/tid/cmdline` will always be
updated based on my observation.
  * Note that `comm` has a length limitation (15 chars)

* `psutil.Process(pid).name()`
*
https://github.com/giampaolo/psutil/blob/a17550784b0d3175da01cdb02cee1bc6b61637dc/psutil/__init__.py#L664-L693
  * Try to retrieve `/proc/pid/comm`.
* If `/proc/pid/comm` is truncated, attempt to retrieve the first
element of `/proc/pid/cmdline`; otherwise, return `/proc/pid/comm`.
* If `cmdline[0]` begins with the contents of `/proc/pid/comm`, return
`cmdline[0]`; otherwise, return the truncated `/proc/pid/comm`.

If the actor task is not running on the main thread (in which case
`setproctitle` will not update `/proc/pid/comm`), and if users pass an
`options.name` to the actor that doesn't start with `/proc/pid/comm`,
then `task_name in psutil.Process(pid).name()` will evaluate to false,
which is incorrect.

```python
def test_default_thread_options_name_long_comm_2(self, ray_start_regular):
    """
    `/proc/PID/comm` is truncated to 15 characters, so the process title
    is "ray::VeryLongCo". `psutil.Process.name()` doesn't return `cmdline()[0]`
    because `cmdline()[0]` doesn't start with "ray::VeryLongCo". This is the
    implementation detail of `psutil`.
    """
    task_name = "hello"

    @ray.remote(concurrency_groups={"io": 1})
    class VeryLongCommActor:
        def check_long_comm(self):
            pid = os.getpid()
            assert _is_actor_task_running(pid, task_name)
            # The first 15 characters of "ray::VeryLongCommActor"
            assert psutil.Process(pid).name() == "ray::VeryLongCo"
            assert psutil.Process(pid).cmdline()[0] == f"ray::{task_name}"
            return pid

    a = VeryLongCommActor.remote()
    pid = ray.get(a.check_long_comm.options(name=task_name).remote())
    wait_for_condition(lambda: not _is_actor_task_running(pid, task_name))
```

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>
dhakshin32 pushed a commit to dhakshin32/ray that referenced this pull request Mar 27, 2025
…1229)

ray-project#51158 adds tests for process titles. However, the definition of "title"
varies across different operating systems, and the behavior of
`psutil.Process().name()` differs as well.

```
[ref]: https://github.com/dvarrazzo/py-setproctitle
> The process title is usually visible in files such as /proc/PID/cmdline,
/proc/PID/status, /proc/PID/comm, depending on the operating system and
kernel version. This information is used by user-space tools such as ps
and top.
```

For example, in Linux, the `/proc/pid/comm` can't be more than 15 chars.
However, MacOS doesn't have `/proc/pid/comm`. In the following test, the
name of the process in Mac is "python3.9", but the name on Linux is
"ray::A.check".

```python
    def test_main_thread_short_comm(self, ray_start_regular):
        """
        Test that the main thread's comm is not truncated.
        """
        @ray.remote
        class A:
            def check(self):
                pid = os.getpid()
                assert _is_actor_task_running(pid, "check")
                assert psutil.Process(pid).name() == "ray::A.check"
                assert psutil.Process(pid).cmdline()[0] == "ray::A.check"
                return pid
        a = A.remote()
        pid = ray.get(a.check.remote())
        wait_for_condition(lambda: not _is_actor_task_running(pid, "check"))
```

## Related issue number

Closes ray-project#51182
Closes ray-project#45966

On my Mac M3

<img width="1728" alt="image"
src="https://github.com/user-attachments/assets/c8c0addf-9c18-402d-bafd-135ff8f6ea7a"
/>

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants