Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from ray-project:master #140

Open
wants to merge 6,529 commits into
base: master
Choose a base branch
from
Open

Conversation

pull[bot]
Copy link

@pull pull bot commented Jun 29, 2023

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

edoakes and others added 28 commits March 7, 2025 16:16
More work towards breaking the core -> libraries dependencies.

- `test_output.py`: changed to use ray core API instead.
- `test_client_library_integration.py`: moved to a test inside rllib
directory.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Train v2 has a new way of doing Train + Tune interop. This PR adds the
new user guide while keeping the old user guide as an orphan page.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
This PR adds log file paths to the Train state schema, for both the
controller and workers. These will be user facing application logs.

## Changes

- Added `controller_log_file_path` field to `TrainRun` and
`log_file_path` to `TrainWorker` schema classes
- Added corresponding fields to protobuf messages for state export API
- Implemented utility methods to retrieve log file paths for controller
and workers
   - Note that these should be called after the loggers are initialized
- Added functionality to assign log file paths to workers in the worker
group

---------

Signed-off-by: Matthew Deng <matt@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This PR moves the checking logic for whether the pyproject.toml etc. are
inside the working directory. The checks are now only done if the
working_dir is not explicitly set by the user. This is to make sure the
checks are not happening on the job submission codepath, since in that
case the working_dir could have been zipped up by the user / on the
client side, so the files are not available in the runtime environment
hook.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: pcmoritz <pcmoritz@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…or not (#51158)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Ray core worker processes change the process title by calling
`setproctitle` before and after executing tasks. Then,
`verify_tasks_running_or_terminated` uses the following logic to check
whether the Ray actor task is running or not:

```python
if psutil.pid_exists(pid) and task_name in psutil.Process(pid).name():
```

However, the logic is not correct in some cases. To understand the
issue, we need to understand the behavior of `setproctitle` and
`psutil.Process(pid).name()`.

* `setproctitle`:
* It calls functions like
[prctl](https://github.com/dvarrazzo/py-setproctitle/blob/cacf96fafa3da1cd1a5b131b4f8b9997c01518d5/src/spt_status.c#L390)
based on the OS and arch under the hood. It updates
`/proc/pid/task/tid/comm` of the calling thread.
* If the thread is the leader thread (i.e. main thread in Python),
`/proc/pid/comm` will also be updated.
* `/proc/pid/cmdline` and `/proc/pid/task/tid/cmdline` will always be
updated based on my observation.
  * Note that `comm` has a length limitation (15 chars)

* `psutil.Process(pid).name()`
*
https://github.com/giampaolo/psutil/blob/a17550784b0d3175da01cdb02cee1bc6b61637dc/psutil/__init__.py#L664-L693
  * Try to retrieve `/proc/pid/comm`.
* If `/proc/pid/comm` is truncated, attempt to retrieve the first
element of `/proc/pid/cmdline`; otherwise, return `/proc/pid/comm`.
* If `cmdline[0]` begins with the contents of `/proc/pid/comm`, return
`cmdline[0]`; otherwise, return the truncated `/proc/pid/comm`.

If the actor task is not running on the main thread (in which case
`setproctitle` will not update `/proc/pid/comm`), and if users pass an
`options.name` to the actor that doesn't start with `/proc/pid/comm`,
then `task_name in psutil.Process(pid).name()` will evaluate to false,
which is incorrect.

```python
def test_default_thread_options_name_long_comm_2(self, ray_start_regular):
    """
    `/proc/PID/comm` is truncated to 15 characters, so the process title
    is "ray::VeryLongCo". `psutil.Process.name()` doesn't return `cmdline()[0]`
    because `cmdline()[0]` doesn't start with "ray::VeryLongCo". This is the
    implementation detail of `psutil`.
    """
    task_name = "hello"

    @ray.remote(concurrency_groups={"io": 1})
    class VeryLongCommActor:
        def check_long_comm(self):
            pid = os.getpid()
            assert _is_actor_task_running(pid, task_name)
            # The first 15 characters of "ray::VeryLongCommActor"
            assert psutil.Process(pid).name() == "ray::VeryLongCo"
            assert psutil.Process(pid).cmdline()[0] == f"ray::{task_name}"
            return pid

    a = VeryLongCommActor.remote()
    pid = ray.get(a.check_long_comm.options(name=task_name).remote())
    wait_for_condition(lambda: not _is_actor_task_running(pid, task_name))
```


## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
…tructure (#51112)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Splits the tests for library-specific telemetry data into their own test
files. Each one has a similarly-structured test with a shared utility
for common logic.

This exercise many issues in the telemetry that we need to address in
follow ups:

- #51163
- #51164
- #51165
- #51166
- #51167
- #51168

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…lease tests (#51176)

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Removing dependencies on libraries from core.

There is nothing specific to tune in this test case and the logic is
well-covered elsewhere. We should prefer to have more targeted
functional tests within core.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
… the trainer (#51093)

User can forget to pass in a `ScalingConfig` at the moment, which
results in a confusing error. Instead, create a default `ScalingConfig`
that defaults to a single CPU worker.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
AsyncDrainNode is unused

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Changing the default tensor serialization in compiled graphs. Also added
a comprehensive set of unit tests covering cases for torch.Tensor
serialization in both Ray core and compiled graphs.

## Related issue number

Related to issues:
  - #50134
  - #50452
Also related to #47742

---------

Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…e`s (#51090)

Currently, a `GPUFuture` contains a recorded CUDA event. When the
`GPUFuture` is garbage-collected, its event is also garbage-collected,
at which point `cupy` destroys that CUDA event.

This is problematic because the event might be destroyed after other
CUDA resources. In particular, we found that the overlapping test in
`test_torch_tensor_dag.py` consistently prints out an invalid cuda
memory error that occurs during dag teardown. Our hypothesis is that:
For an event, CUDA likely stores a pointer to the stream it recorded.
Since the CUDA streams are destroyed before the events, the stream
pointer is no longer valid when the event is destroyed.

This PR fixes the issue by caching an actor's unresolved `GPUFuture`s in
its serialization context. After the `GPUFuture` has been waited on, its
event is manually destroyed. During teardown, the events inside all
unresolved `GPUFuture`s are destroyed before other CUDA resources.

---------

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Co-authored-by: Weixin Deng <weixin@cs.washington.edu>
…`) (#51089)

As part of Run cpplint on all `.cc`/`.h` files in
`src/ray/object_manager` and `src/ray/object_manager/test`

## Related issue number

Main issue: #50583
Closes sub issue: #51084

---------

Signed-off-by: Elijah Melton <elimelt@uw.edu>
For Python/CPython, the GIL should be automatically released and
reacquired at every [switch
interval](https://docs.python.org/3/library/sys.html#sys.setswitchinterval).
However, Cython somehow interferes with the GIL, so when we try to call
`PyGILState_Release` to release a thread, the following error message is
displayed:

<img width="1728" alt="image"
src="https://github.com/user-attachments/assets/9f138859-ba9e-445a-94ee-02a5fa60f922"
/>

The error message shows that the current thread (which holds the GIL) is
the main thread, but we are trying to release another thread (i.e.
default executor). The [CPython source
code](https://github.com/python/cpython/blob/475f933ed8b1c9546f1b5497a2241140c7065b5f/Python/pystate.c#L2862)
shows that this means that the default executor doesn't hold GIL at that
time.

In this PR, we explicitly acquire GIL when we call `PyGILState_Release`.

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Use smaller `:object_manager` targets as dependences.

## Related issue number

Closes #51014

---------

Signed-off-by: Elijah Melton <elimelt@uw.edu>
We have an experimental tensorflow utils library that was moved out of
`rllib` and into core in 2019 (thread:
#5698).

The way it's written makes Ray depend on RLlib (circular dependency). We
could refactor this code to move the other related utils into Ray, but
as far as I can tell, outside of `rllib`, this code was only used in an
example from the documentation that was removed in 2022:
#24727 (some of the code was
missed, I'm removing it here).

We do not expose it in our documentation at all and it lives under
`experimental`, so should be safe to remove.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Breaking dependencies on libraries from core.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…51033)

Currently, the worker fires-and-forgets a disconnect message to the
raylet, then goes through its shutdown procedure and closes the socket.

However, in #50812 we want to add
detect for unexpected disconnects on the socket. With the current
procedure the worker may send the disconnect and close its socket
immediately, causing the raylet to detect the unexpected disconnect
prior to processing the graceful disconnect message.

This PR adds a `DisconnectClientReply` that the raylet will send once it
has processed the disconnect message. The worker blocks until it
receives this reply, then proceeds with the disconnect procedure and
closes its socket.

## Related issue number

Towards #49999

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Pre-merge passed on the last fix just because the tests were already
marked flaky, and not all of them were running on pre-merge because of
flakiness. Tests already marked flaky only run on post-merge.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Remove `python/ray/rllib/` directory during core tests to enforce
componentization.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
…1192)

1. Currently, the output of `ray status -v` only includes information on
node types (i.e., group names in KubeRay) and Ray node IDs. However, it
is not easy to map a Ray node ID to the name of the corresponding Ray
Pod (i.e. instance id in Autoscaler).

<img width="496" alt="Screenshot 2025-03-08 at 11 50 43 PM"
src="https://github.com/user-attachments/assets/89c66096-88c2-47fb-80d6-08067c7b9d90"
/>

2. Refactor

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
This is a very old regression test, refactored it to test the underlying
core behavior directly.

Also removed a test that has been skipped for a long time.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Remove `python/ray/tune/` directory during core tests to enforce
componentization.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
srinathk10 and others added 30 commits March 24, 2025 10:05
## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->
* Fix Ray Train release test Arg parsing.
* Bump up limit from 500K to 1M rows.

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
## Why are these changes needed?

Update the list of possible deployment statuses in the docs.

## Related issue number

<!-- For example: "Closes #1234" -->

---------

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: akyang-anyscale <alexyang@anyscale.com>
#51645)

Flaky on windows

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
Enable CI environment with writeable cgroupv2 in CI. 

To use this environment for testing, you need to use the
`--privileged-container` flag in your test definitions.
To enable bazel read/write to sys/fs/cgroup in your tests, you need to
use the `--build-type cgroup`

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Signed-off-by: Abrar Sheikh <abrar@anyscale.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
## Why are these changes needed?

This PR removes a feature flag that controls whether the proxy should
use cached replica queue length values for routing. The FF was
[introduced](#42943) over a year
ago as a way for users to quickly switch back to the previous
implementation. It has been enabled by default for [over a
year](#43169) now and works as
expected, so let's remove it. Consequently, this PR also removes
`RAY_SERVE_ENABLE_STRICT_MAX_ONGOING_REQUESTS`, as it is always enabled
if `RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE` is enabled.

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

The current Databricks integration in Ray Data requires providing the
Databricks host URL without the "https://" prefix. However, this creates
compatibility issues when using Ray Data alongside MLflow, as MLflow's
Databricks integration (which uses the same DATABRICKS_HOST environment
variable) expects the URL to include the "https://" prefix.

## Related issue number

Closes #49925 

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Gil Leibovitz <gil.leibovitz@doubleverify.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Gil Leibovitz <gil.leibovitz@doubleverify.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
…51433)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

Update repartition on target_num_rows_per_block:
When `target_num_rows_per_block` is set, it only repartitions Dataset
blocks that are larger than `target_num_rows_per_block`.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
uv much much faster

also adds tests to enforce dependency integrity.

Signed-off-by: kevin <kevin@anyscale.com>
so that "lint" tag can be used for just lints, and will not be abused to
be used as "always"

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…51668)

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: jukejian <jukejian@bytedance.com>
…or missing keys (#44769)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

As stated in #44768, the current implementation of `multiget` based on
`np.searchsorted` does not check for missing keys. I added the required
checks and updated unit test for this case.

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->
Closes #44768 

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Wu Yufei <wuyufei.2000@bytedance.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
This PR adds async generator support to `flat_map`.
The implementation is similar to how #46129 handled async callable
classes for map_batches(), changes include:
* Generalize the logic in `_generate_transform_fn_for_async_flat_map` so
it can process both batches and rows
* Add test case for async `flat_map`
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number
#50329 
<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Drice1999 <chenxh267@gmail.com>
cgroup tests access (read and write) system directory, so they shouldn't
be executed in parallel.
For example, `bazel test //src/ray/common/cgroup/tests:all` should run
tests one by one.

---------

Signed-off-by: dentiny <dentinyhao@gmail.com>
…nals (#51582)

If a threaded actor receives exit signals twice as shown in the above
screenshot, it will execute
[task_receiver_->Stop()](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/core_worker.cc#L1224)
twice. However, the second call to `task_receiver_->Stop()` will get
stuck forever when executing the
[releaser](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/transport/concurrency_group_manager.cc#L135).

### Reproduction

```sh
ray start --head --include-dashboard=True --num-cpus=1
# https://gist.github.com/kevin85421/7a42ac3693537c2148fa554065bb5223
python3 test.py

# Some actors are still ALIVE. If all actors are DEAD, increase the number of actors.
ray list actors
```

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
This PR fixes all variable shadowing for core worker.

Variable shadowing is known to be error prune --- I personally just
spent 20 minute debugging an issue caused by it.
Need to address all issues before
#51669

Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: Linkun Chen <github@lkchen.net>
When doing release benchmark comparison, I found the script is
performing comparison on **full filepaths** rather than **release result
filenames**.

Signed-off-by: dentiny <dentinyhao@gmail.com>
Resize is pretty inefficient default constructing all these in the case
of a lot of args.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
While looking at a dashboard flame graph, memory_full_info was actually
visible even though it only runs every 5 seconds. Our two calls mean
we're getting the info twice so even more expensive, and it can also
result in inconsistent info between uss and rss. We also get an
additional speedup by using with oneshot() with getting cpu percent as
well. https://psutil.readthedocs.io/en/latest/#psutil.Process.oneshot.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
…50375)

For an easy-to-use abstraction that'll allow use of a flat map with
reader/writer locks without thinking about thread safety or correct use.

Also see discussion here for thread-safe gcs node manager
#50024 (comment).

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
These tests should be owned by core.

---------

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Current implementation is, we create multiple grpc stubs but not TCP
connections.

I did some benchmark, which shows multiple grpc stub does improvement
performance so we cannot simply remove them.
Baseline: #51656
Benchmark result:
```sh
ubuntu@hjiang-devbox-pg$ python3 /home/ubuntu/ray/release/release_logs/compare_perf_metrics /home/ubuntu/ray/release/release_logs/base_version /home/ubuntu/ray/release/release_logs/new_version
REGRESSION 14.66%: single_client_tasks_and_get_batch (THROUGHPUT) regresses from 5.969306035542166 to 5.094453847363354 (14.66%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 14.51%: tasks_per_second (THROUGHPUT) regresses from 229.36388671973262 to 196.07926002040125 (14.51%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_nodes.json
REGRESSION 12.24%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.238079759644892 to 11.61748355563387 (12.24%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 6.99%: multi_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 16841.57415006188 to 15663.994106069527 (6.99%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 6.66%: single_client_wait_1k_refs (THROUGHPUT) regresses from 5.0308320360232734 to 4.695580178438465 (6.66%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 3.27%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8271.004793738302 to 8000.250850666334 (3.27%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 2.93%: client__tasks_and_put_batch (THROUGHPUT) regresses from 14039.49342189659 to 13628.021852828302 (2.93%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 2.57%: 1_n_async_actor_calls_async (THROUGHPUT) regresses from 7549.855764807742 to 7355.844212331484 (2.57%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 2.51%: n_n_async_actor_calls_async (THROUGHPUT) regresses from 23671.11885393803 to 23076.0705989556 (2.51%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 2.48%: single_client_tasks_sync (THROUGHPUT) regresses from 952.0269029041373 to 928.4100648169424 (2.48%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 2.27%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2786.0434704559507 to 2722.7190863593105 (2.27%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 1.89%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 523.9186869126953 to 514.0347539070235 (1.89%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 1.48%: placement_group_create/removal (THROUGHPUT) regresses from 755.9036396341403 to 744.686404601542 (1.48%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 1.38%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1055.293613301056 to 1040.6856076865438 (1.38%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 1.21%: client__tasks_and_get_batch (THROUGHPUT) regresses from 0.963509372266668 to 0.9518418662213537 (1.21%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 1.00%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 4998.683273892644 to 4948.940736644676 (1.00%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 0.95%: n_n_actor_calls_async (THROUGHPUT) regresses from 27023.396892676123 to 26767.1762953109 (0.95%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 0.94%: single_client_put_gigabytes (THROUGHPUT) regresses from 19.0889649635755 to 18.909259900586363 (0.94%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 0.72%: 1_1_async_actor_calls_async (THROUGHPUT) regresses from 4642.895287623349 to 4609.41957458445 (0.72%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 0.71%: single_client_tasks_async (THROUGHPUT) regresses from 7909.367644778592 to 7852.882247928618 (0.71%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 0.37%: single_client_get_calls_Plasma_Store (THROUGHPUT) regresses from 10536.660461092333 to 10497.924384397613 (0.37%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 0.31%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1045.148463125327 to 1041.9241552653498 (0.31%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 0.14%: 1_1_actor_calls_concurrent (THROUGHPUT) regresses from 5161.152587049948 to 5154.097793024906 (0.14%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 0.08%: client__put_calls (THROUGHPUT) regresses from 774.1865536687783 to 773.580427400101 (0.08%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json
REGRESSION 29.16%: dashboard_p95_latency_ms (LATENCY) regresses from 26.737 to 34.534 (29.16%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_nodes.json
REGRESSION 28.52%: dashboard_p99_latency_ms (LATENCY) regresses from 3058.247 to 3930.337 (28.52%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_actors.json
REGRESSION 16.62%: stage_3_creation_time (LATENCY) regresses from 1.8721938133239746 to 2.183365821838379 (16.62%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_many_tasks.json
REGRESSION 11.56%: dashboard_p99_latency_ms (LATENCY) regresses from 277.091 to 309.116 (11.56%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_pgs.json
REGRESSION 9.37%: dashboard_p99_latency_ms (LATENCY) regresses from 77.557 to 84.822 (9.37%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_nodes.json
REGRESSION 5.97%: avg_pg_remove_time_ms (LATENCY) regresses from 1.3214836546547326 to 1.4004240105108736 (5.97%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_placement_group.json
REGRESSION 5.43%: dashboard_p50_latency_ms (LATENCY) regresses from 5.399 to 5.692 (5.43%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_nodes.json
REGRESSION 3.73%: avg_iteration_time (LATENCY) regresses from 1.1903677296638489 to 1.234775812625885 (3.73%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_dead_actors.json
REGRESSION 1.09%: stage_1_avg_iteration_time (LATENCY) regresses from 12.530043482780457 to 12.667147731781006 (1.09%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_many_tasks.json
REGRESSION 0.48%: stage_2_avg_iteration_time (LATENCY) regresses from 38.180511474609375 to 38.36462712287903 (0.48%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_many_tasks.json
REGRESSION 0.32%: stage_3_time (LATENCY) regresses from 1884.1879193782806 to 1890.2630755901337 (0.32%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_many_tasks.json
REGRESSION 0.05%: 10000_args_time (LATENCY) regresses from 19.075126932999993 to 19.085101717 (0.05%) in /home/ubuntu/ray/release/release_logs/new_version/scalability/single_node.json
```

Reproduction commands:
```sh
BUILDKITE_TOKEN=bkua_77ebf3852a34e264e48619dab87574f634e9e4f7 python3 /home/ubuntu/ray/release/release_logs/fetch_release_logs.py new_version e00693c dentiny:hjiang/fix-client-conn

BUILDKITE_TOKEN=bkua_77ebf3852a34e264e48619dab87574f634e9e4f7 python3 /home/ubuntu/ray/release/release_logs/fetch_release_logs.py base_version b92dcd1 dentiny:hjiang/baseline-release-benchmark

python3 /home/ubuntu/ray/release/release_logs/compare_perf_metrics /home/ubuntu/ray/release/release_logs/base_version /home/ubuntu/ray/release/release_logs/new_version
```

Signed-off-by: dentiny <dentinyhao@gmail.com>
…rd fail (#51693)

keep things in sync and intact

Signed-off-by: kevin <kevin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.