[pull] master from ray-project:master #140

pull · 2023-06-29T21:04:01Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

More work towards breaking the core -> libraries dependencies. - `test_output.py`: changed to use ray core API instead. - `test_client_library_integration.py`: moved to a test inside rllib directory. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Train v2 has a new way of doing Train + Tune interop. This PR adds the new user guide while keeping the old user guide as an orphan page. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

This PR adds log file paths to the Train state schema, for both the controller and workers. These will be user facing application logs. ## Changes - Added `controller_log_file_path` field to `TrainRun` and `log_file_path` to `TrainWorker` schema classes - Added corresponding fields to protobuf messages for state export API - Implemented utility methods to retrieve log file paths for controller and workers - Note that these should be called after the loggers are initialized - Added functionality to assign log file paths to workers in the worker group --------- Signed-off-by: Matthew Deng <matt@anyscale.com>

## Why are these changes needed? This PR moves the checking logic for whether the pyproject.toml etc. are inside the working directory. The checks are now only done if the working_dir is not explicitly set by the user. This is to make sure the checks are not happening on the job submission codepath, since in that case the working_dir could have been zipped up by the user / on the client side, so the files are not available in the runtime environment hook. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: pcmoritz <pcmoritz@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…or not (#51158)   ## Why are these changes needed? Ray core worker processes change the process title by calling `setproctitle` before and after executing tasks. Then, `verify_tasks_running_or_terminated` uses the following logic to check whether the Ray actor task is running or not: ```python if psutil.pid_exists(pid) and task_name in psutil.Process(pid).name(): ``` However, the logic is not correct in some cases. To understand the issue, we need to understand the behavior of `setproctitle` and `psutil.Process(pid).name()`. * `setproctitle`: * It calls functions like [prctl](https://github.com/dvarrazzo/py-setproctitle/blob/cacf96fafa3da1cd1a5b131b4f8b9997c01518d5/src/spt_status.c#L390) based on the OS and arch under the hood. It updates `/proc/pid/task/tid/comm` of the calling thread. * If the thread is the leader thread (i.e. main thread in Python), `/proc/pid/comm` will also be updated. * `/proc/pid/cmdline` and `/proc/pid/task/tid/cmdline` will always be updated based on my observation. * Note that `comm` has a length limitation (15 chars) * `psutil.Process(pid).name()` * https://github.com/giampaolo/psutil/blob/a17550784b0d3175da01cdb02cee1bc6b61637dc/psutil/__init__.py#L664-L693 * Try to retrieve `/proc/pid/comm`. * If `/proc/pid/comm` is truncated, attempt to retrieve the first element of `/proc/pid/cmdline`; otherwise, return `/proc/pid/comm`. * If `cmdline[0]` begins with the contents of `/proc/pid/comm`, return `cmdline[0]`; otherwise, return the truncated `/proc/pid/comm`. If the actor task is not running on the main thread (in which case `setproctitle` will not update `/proc/pid/comm`), and if users pass an `options.name` to the actor that doesn't start with `/proc/pid/comm`, then `task_name in psutil.Process(pid).name()` will evaluate to false, which is incorrect. ```python def test_default_thread_options_name_long_comm_2(self, ray_start_regular): """ `/proc/PID/comm` is truncated to 15 characters, so the process title is "ray::VeryLongCo". `psutil.Process.name()` doesn't return `cmdline()[0]` because `cmdline()[0]` doesn't start with "ray::VeryLongCo". This is the implementation detail of `psutil`. """ task_name = "hello" @ray.remote(concurrency_groups={"io": 1}) class VeryLongCommActor: def check_long_comm(self): pid = os.getpid() assert _is_actor_task_running(pid, task_name) # The first 15 characters of "ray::VeryLongCommActor" assert psutil.Process(pid).name() == "ray::VeryLongCo" assert psutil.Process(pid).cmdline()[0] == f"ray::{task_name}" return pid a = VeryLongCommActor.remote() pid = ray.get(a.check_long_comm.options(name=task_name).remote()) wait_for_condition(lambda: not _is_actor_task_running(pid, task_name)) ``` ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kaihsun <kaihsun@anyscale.com>

…tructure (#51112) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Splits the tests for library-specific telemetry data into their own test files. Each one has a similarly-structured test with a shared utility for common logic. This exercise many issues in the telemetry that we need to address in follow ups: - #51163 - #51164 - #51165 - #51166 - #51167 - #51168 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…lease tests (#51176) --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Removing dependencies on libraries from core. There is nothing specific to tune in this test case and the logic is well-covered elsewhere. We should prefer to have more targeted functional tests within core. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

… the trainer (#51093) User can forget to pass in a `ScalingConfig` at the moment, which results in a confusing error. Instead, create a default `ScalingConfig` that defaults to a single CPU worker. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: kaihsun <kaihsun@anyscale.com>

AsyncDrainNode is unused --------- Signed-off-by: dayshah <dhyey2019@gmail.com>

Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs. ## Related issue number Related to issues: - #50134 - #50452 Also related to #47742 --------- Signed-off-by: Amjad Almahairi <anm@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…e`s (#51090) Currently, a `GPUFuture` contains a recorded CUDA event. When the `GPUFuture` is garbage-collected, its event is also garbage-collected, at which point `cupy` destroys that CUDA event. This is problematic because the event might be destroyed after other CUDA resources. In particular, we found that the overlapping test in `test_torch_tensor_dag.py` consistently prints out an invalid cuda memory error that occurs during dag teardown. Our hypothesis is that: For an event, CUDA likely stores a pointer to the stream it recorded. Since the CUDA streams are destroyed before the events, the stream pointer is no longer valid when the event is destroyed. This PR fixes the issue by caching an actor's unresolved `GPUFuture`s in its serialization context. After the `GPUFuture` has been waited on, its event is manually destroyed. During teardown, the events inside all unresolved `GPUFuture`s are destroyed before other CUDA resources. --------- Signed-off-by: Yuhan Ruan <andyubryh@gmail.com> Signed-off-by: Weixin Deng <weixin@cs.washington.edu> Co-authored-by: Weixin Deng <weixin@cs.washington.edu>

…`) (#51089) As part of Run cpplint on all `.cc`/`.h` files in `src/ray/object_manager` and `src/ray/object_manager/test` ## Related issue number Main issue: #50583 Closes sub issue: #51084 --------- Signed-off-by: Elijah Melton <elimelt@uw.edu>

For Python/CPython, the GIL should be automatically released and reacquired at every [switch interval](https://docs.python.org/3/library/sys.html#sys.setswitchinterval). However, Cython somehow interferes with the GIL, so when we try to call `PyGILState_Release` to release a thread, the following error message is displayed: <img width="1728" alt="image" src="https://github.com/user-attachments/assets/9f138859-ba9e-445a-94ee-02a5fa60f922" /> The error message shows that the current thread (which holds the GIL) is the main thread, but we are trying to release another thread (i.e. default executor). The [CPython source code](https://github.com/python/cpython/blob/475f933ed8b1c9546f1b5497a2241140c7065b5f/Python/pystate.c#L2862) shows that this means that the default executor doesn't hold GIL at that time. In this PR, we explicitly acquire GIL when we call `PyGILState_Release`. Signed-off-by: kaihsun <kaihsun@anyscale.com>

Use smaller `:object_manager` targets as dependences. ## Related issue number Closes #51014 --------- Signed-off-by: Elijah Melton <elimelt@uw.edu>

We have an experimental tensorflow utils library that was moved out of `rllib` and into core in 2019 (thread: #5698). The way it's written makes Ray depend on RLlib (circular dependency). We could refactor this code to move the other related utils into Ray, but as far as I can tell, outside of `rllib`, this code was only used in an example from the documentation that was removed in 2022: #24727 (some of the code was missed, I'm removing it here). We do not expose it in our documentation at all and it lives under `experimental`, so should be safe to remove. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Breaking dependencies on libraries from core. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…sts metrics. (#51159)

…51033) Currently, the worker fires-and-forgets a disconnect message to the raylet, then goes through its shutdown procedure and closes the socket. However, in #50812 we want to add detect for unexpected disconnects on the socket. With the current procedure the worker may send the disconnect and close its socket immediately, causing the raylet to detect the unexpected disconnect prior to processing the graceful disconnect message. This PR adds a `DisconnectClientReply` that the raylet will send once it has processed the disconnect message. The worker blocks until it receives this reply, then proceeds with the disconnect procedure and closes its socket. ## Related issue number Towards #49999 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Pre-merge passed on the last fix just because the tests were already marked flaky, and not all of them were running on pre-merge because of flakiness. Tests already marked flaky only run on post-merge. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>

Remove `python/ray/rllib/` directory during core tests to enforce componentization. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

…1192) 1. Currently, the output of `ray status -v` only includes information on node types (i.e., group names in KubeRay) and Ray node IDs. However, it is not easy to map a Ray node ID to the name of the corresponding Ray Pod (i.e. instance id in Autoscaler). <img width="496" alt="Screenshot 2025-03-08 at 11 50 43 PM" src="https://github.com/user-attachments/assets/89c66096-88c2-47fb-80d6-08067c7b9d90" /> 2. Refactor --------- Signed-off-by: kaihsun <kaihsun@anyscale.com>

This is a very old regression test, refactored it to test the underlying core behavior directly. Also removed a test that has been skipped for a long time. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Remove `python/ray/tune/` directory during core tests to enforce componentization. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

## Why are these changes needed?  * Fix Ray Train release test Arg parsing. * Bump up limit from 500K to 1M rows. Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…e for duration="auto". (#51637)

## Why are these changes needed? Update the list of possible deployment statuses in the docs. ## Related issue number  --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: akyang-anyscale <alexyang@anyscale.com>

#51645) Flaky on windows Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

Enable CI environment with writeable cgroupv2 in CI. To use this environment for testing, you need to use the `--privileged-container` flag in your test definitions. To enable bazel read/write to sys/fs/cgroup in your tests, you need to use the `--build-type cgroup` --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>

Signed-off-by: Abrar Sheikh <abrar@anyscale.com>

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

## Why are these changes needed? This PR removes a feature flag that controls whether the proxy should use cached replica queue length values for routing. The FF was [introduced](#42943) over a year ago as a way for users to quickly switch back to the previous implementation. It has been enabled by default for [over a year](#43169) now and works as expected, so let's remove it. Consequently, this PR also removes `RAY_SERVE_ENABLE_STRICT_MAX_ONGOING_REQUESTS`, as it is always enabled if `RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE` is enabled. Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

## Why are these changes needed? The current Databricks integration in Ray Data requires providing the Databricks host URL without the "https://" prefix. However, this creates compatibility issues when using Ray Data alongside MLflow, as MLflow's Databricks integration (which uses the same DATABRICKS_HOST environment variable) expects the URL to include the "https://" prefix. ## Related issue number Closes #49925 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Gil Leibovitz <gil.leibovitz@doubleverify.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Gil Leibovitz <gil.leibovitz@doubleverify.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

…51433)   ## Why are these changes needed?  Update repartition on target_num_rows_per_block: When `target_num_rows_per_block` is set, it only repartitions Dataset blocks that are larger than `target_num_rows_per_block`. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

uv much much faster also adds tests to enforce dependency integrity. Signed-off-by: kevin <kevin@anyscale.com>

so that "lint" tag can be used for just lints, and will not be abused to be used as "always" Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…51668) Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

## Why are these changes needed?  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: jukejian <jukejian@bytedance.com>

…or missing keys (#44769)   As stated in #44768, the current implementation of `multiget` based on `np.searchsorted` does not check for missing keys. I added the required checks and updated unit test for this case. ## Why are these changes needed?  ## Related issue number  Closes #44768 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Wu Yufei <wuyufei.2000@bytedance.com>

## Why are these changes needed? This PR adds async generator support to `flat_map`. The implementation is similar to how #46129 handled async callable classes for map_batches(), changes include: * Generalize the logic in `_generate_transform_fn_for_async_flat_map` so it can process both batches and rows * Add test case for async `flat_map`  ## Related issue number #50329  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Drice1999 <chenxh267@gmail.com>

cgroup tests access (read and write) system directory, so they shouldn't be executed in parallel. For example, `bazel test //src/ray/common/cgroup/tests:all` should run tests one by one. --------- Signed-off-by: dentiny <dentinyhao@gmail.com>

…nals (#51582) If a threaded actor receives exit signals twice as shown in the above screenshot, it will execute [task_receiver_->Stop()](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/core_worker.cc#L1224) twice. However, the second call to `task_receiver_->Stop()` will get stuck forever when executing the [releaser](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/transport/concurrency_group_manager.cc#L135). ### Reproduction ```sh ray start --head --include-dashboard=True --num-cpus=1 # https://gist.github.com/kevin85421/7a42ac3693537c2148fa554065bb5223 python3 test.py # Some actors are still ALIVE. If all actors are DEAD, increase the number of actors. ray list actors ``` --------- Signed-off-by: kaihsun <kaihsun@anyscale.com>

This PR fixes all variable shadowing for core worker. Variable shadowing is known to be error prune --- I personally just spent 20 minute debugging an issue caused by it. Need to address all issues before #51669 Signed-off-by: dentiny <dentinyhao@gmail.com>

Signed-off-by: Linkun Chen <github@lkchen.net>

When doing release benchmark comparison, I found the script is performing comparison on **full filepaths** rather than **release result filenames**. Signed-off-by: dentiny <dentinyhao@gmail.com>

Resize is pretty inefficient default constructing all these in the case of a lot of args. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

While looking at a dashboard flame graph, memory_full_info was actually visible even though it only runs every 5 seconds. Our two calls mean we're getting the info twice so even more expensive, and it can also result in inconsistent info between uss and rss. We also get an additional speedup by using with oneshot() with getting cpu percent as well. https://psutil.readthedocs.io/en/latest/#psutil.Process.oneshot. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>

Signed-off-by: Gene Su <e870252314@gmail.com>

…50375) For an easy-to-use abstraction that'll allow use of a flat map with reader/writer locks without thinking about thread safety or correct use. Also see discussion here for thread-safe gcs node manager #50024 (comment). --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

These tests should be owned by core. --------- Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

Current implementation is, we create multiple grpc stubs but not TCP connections. I did some benchmark, which shows multiple grpc stub does improvement performance so we cannot simply remove them. Baseline: #51656 Benchmark result: ```sh ubuntu@hjiang-devbox-pg$ python3 /home/ubuntu/ray/release/release_logs/compare_perf_metrics /home/ubuntu/ray/release/release_logs/base_version /home/ubuntu/ray/release/release_logs/new_version REGRESSION 14.66%: single_client_tasks_and_get_batch (THROUGHPUT) regresses from 5.969306035542166 to 5.094453847363354 (14.66%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 14.51%: tasks_per_second (THROUGHPUT) regresses from 229.36388671973262 to 196.07926002040125 (14.51%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_nodes.json REGRESSION 12.24%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.238079759644892 to 11.61748355563387 (12.24%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 6.99%: multi_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 16841.57415006188 to 15663.994106069527 (6.99%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 6.66%: single_client_wait_1k_refs (THROUGHPUT) regresses from 5.0308320360232734 to 4.695580178438465 (6.66%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 3.27%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8271.004793738302 to 8000.250850666334 (3.27%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 2.93%: client__tasks_and_put_batch (THROUGHPUT) regresses from 14039.49342189659 to 13628.021852828302 (2.93%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 2.57%: 1_n_async_actor_calls_async (THROUGHPUT) regresses from 7549.855764807742 to 7355.844212331484 (2.57%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 2.51%: n_n_async_actor_calls_async (THROUGHPUT) regresses from 23671.11885393803 to 23076.0705989556 (2.51%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 2.48%: single_client_tasks_sync (THROUGHPUT) regresses from 952.0269029041373 to 928.4100648169424 (2.48%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 2.27%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2786.0434704559507 to 2722.7190863593105 (2.27%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 1.89%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 523.9186869126953 to 514.0347539070235 (1.89%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 1.48%: placement_group_create/removal (THROUGHPUT) regresses from 755.9036396341403 to 744.686404601542 (1.48%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 1.38%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1055.293613301056 to 1040.6856076865438 (1.38%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 1.21%: client__tasks_and_get_batch (THROUGHPUT) regresses from 0.963509372266668 to 0.9518418662213537 (1.21%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 1.00%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 4998.683273892644 to 4948.940736644676 (1.00%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.95%: n_n_actor_calls_async (THROUGHPUT) regresses from 27023.396892676123 to 26767.1762953109 (0.95%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.94%: single_client_put_gigabytes (THROUGHPUT) regresses from 19.0889649635755 to 18.909259900586363 (0.94%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.72%: 1_1_async_actor_calls_async (THROUGHPUT) regresses from 4642.895287623349 to 4609.41957458445 (0.72%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.71%: single_client_tasks_async (THROUGHPUT) regresses from 7909.367644778592 to 7852.882247928618 (0.71%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.37%: single_client_get_calls_Plasma_Store (THROUGHPUT) regresses from 10536.660461092333 to 10497.924384397613 (0.37%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.31%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1045.148463125327 to 1041.9241552653498 (0.31%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.14%: 1_1_actor_calls_concurrent (THROUGHPUT) regresses from 5161.152587049948 to 5154.097793024906 (0.14%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.08%: client__put_calls (THROUGHPUT) regresses from 774.1865536687783 to 773.580427400101 (0.08%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 29.16%: dashboard_p95_latency_ms (LATENCY) regresses from 26.737 to 34.534 (29.16%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_nodes.json REGRESSION 28.52%: dashboard_p99_latency_ms (LATENCY) regresses from 3058.247 to 3930.337 (28.52%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_actors.json REGRESSION 16.62%: stage_3_creation_time (LATENCY) regresses from 1.8721938133239746 to 2.183365821838379 (16.62%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_many_tasks.json REGRESSION 11.56%: dashboard_p99_latency_ms (LATENCY) regresses from 277.091 to 309.116 (11.56%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_pgs.json REGRESSION 9.37%: dashboard_p99_latency_ms (LATENCY) regresses from 77.557 to 84.822 (9.37%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_nodes.json REGRESSION 5.97%: avg_pg_remove_time_ms (LATENCY) regresses from 1.3214836546547326 to 1.4004240105108736 (5.97%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_placement_group.json REGRESSION 5.43%: dashboard_p50_latency_ms (LATENCY) regresses from 5.399 to 5.692 (5.43%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_nodes.json REGRESSION 3.73%: avg_iteration_time (LATENCY) regresses from 1.1903677296638489 to 1.234775812625885 (3.73%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_dead_actors.json REGRESSION 1.09%: stage_1_avg_iteration_time (LATENCY) regresses from 12.530043482780457 to 12.667147731781006 (1.09%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_many_tasks.json REGRESSION 0.48%: stage_2_avg_iteration_time (LATENCY) regresses from 38.180511474609375 to 38.36462712287903 (0.48%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_many_tasks.json REGRESSION 0.32%: stage_3_time (LATENCY) regresses from 1884.1879193782806 to 1890.2630755901337 (0.32%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_many_tasks.json REGRESSION 0.05%: 10000_args_time (LATENCY) regresses from 19.075126932999993 to 19.085101717 (0.05%) in /home/ubuntu/ray/release/release_logs/new_version/scalability/single_node.json ``` Reproduction commands: ```sh BUILDKITE_TOKEN=bkua_77ebf3852a34e264e48619dab87574f634e9e4f7 python3 /home/ubuntu/ray/release/release_logs/fetch_release_logs.py new_version e00693c dentiny:hjiang/fix-client-conn BUILDKITE_TOKEN=bkua_77ebf3852a34e264e48619dab87574f634e9e4f7 python3 /home/ubuntu/ray/release/release_logs/fetch_release_logs.py base_version b92dcd1 dentiny:hjiang/baseline-release-benchmark python3 /home/ubuntu/ray/release/release_logs/compare_perf_metrics /home/ubuntu/ray/release/release_logs/base_version /home/ubuntu/ray/release/release_logs/new_version ``` Signed-off-by: dentiny <dentinyhao@gmail.com>

…rd fail (#51693) keep things in sync and intact Signed-off-by: kevin <kevin@anyscale.com>

pull bot added the ⤵️ pull label Jun 29, 2023

aslonnie force-pushed the master branch from 966c10e to 3d32f98 Compare December 25, 2023 04:13

edoakes and others added 28 commits March 7, 2025 16:16

[train v2][doc] Add updated Train + Tune user guide (#51048)

b9185ac

Train v2 has a new way of doing Train + Tune interop. This PR adds the new user guide while keeping the old user guide as an orphan page. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[serve.llm] Trim probes and reorg the llm.serve release test folder s…

c889df2

…tructure (#51112) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

[Data] Add fixed-size compute for read, write, and batch inference re…

cdac73c

…lease tests (#51176) --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

[core][dashboard] Lazy init NotifyQueue (#51178)

e7268f2

Signed-off-by: kaihsun <kaihsun@anyscale.com>

[core] Remove unused AsyncDrainNode (#51043)

2f8684f

AsyncDrainNode is unused --------- Signed-off-by: dayshah <dhyey2019@gmail.com>

[CI] use smaller object_manager deps in targets (#51062)

74f53df

Use smaller `:object_manager` targets as dependences. ## Related issue number Closes #51014 --------- Signed-off-by: Elijah Melton <elimelt@uw.edu>

[core] Remove AIR util from test_ray_init_2 (#51215)

9c9cbba

Breaking dependencies on libraries from core. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[RLlib] Unify namings for actor managers' outstanding in-flight reque…

b1e7570

…sts metrics. (#51159)

[core] Mask rllib directory for core tests (#50618)

c86f287

Remove `python/ray/rllib/` directory during core tests to enforce componentization. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[core][cgraph] Clean up CompiledDAG.actor_refs (#51174)

7ee8694

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

[doc][core][cgraph] Add bind() to API page (#51044)

59df39a

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

[core] Refactor test_run_driver_twice to not use tune (#51220)

91815ba

This is a very old regression test, refactored it to test the underlying core behavior directly. Also removed a test that has been skipped for a long time. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[core] Mask tune directory for core tests (#51186)

e3a5f20

Remove `python/ray/tune/` directory during core tests to enforce componentization. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

srinathk10 and others added 30 commits March 24, 2025 10:05

[RLlib] Make min/max env steps per evaluation sample call configurabl…

7eb8d91

…e for duration="auto". (#51637)

Skip multiplex metrics and proxy status code is error tests on windows (

1103bd1

#51645) Flaky on windows Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

[Test][KubeRay] Add doctest for RayCluster Quickstart doc (#51249)

5e05c2f

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

refactor replica _handle_errors_and_metrics (#51644)

ffe5a09

Signed-off-by: Abrar Sheikh <abrar@anyscale.com>

[ray.llm] Refactor model download utilities (#51604)

f3afcba

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

[deps] Use UV to compile LLM dependencies (#51323)

9f441d9

uv much much faster also adds tests to enforce dependency integrity. Signed-off-by: kevin <kevin@anyscale.com>

[ci] add an always tag for cond testing (#51662)

0d01ce6

so that "lint" tag can be used for just lints, and will not be abused to be used as "always" Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[core] Correct the wording in the OnNodeDead logs to avoid confusion (#…

8a77a92

…51668) Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

[data.llm] support trust remote code (#51680)

447c4f1

Signed-off-by: Linkun Chen <github@lkchen.net>

[release] Fix perf metrics compare (#51655)

cb5e33f

When doing release benchmark comparison, I found the script is performing comparison on **full filepaths** rather than **release result filenames**. Signed-off-by: dentiny <dentinyhao@gmail.com>

[core] Avoid resize in GetAndPinArgsForExecutor (#51543)

b391072

Resize is pretty inefficient default constructing all these in the case of a lot of args. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>

[Feat][Core/Dashboard] Convert JobHead to subprocess module (#51553)

9ca058b

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

[Serve.llm] fix loading model from remote storage and add docs (#51617)

747fc64

Signed-off-by: Gene Su <e870252314@gmail.com>

[tests] Reassign dashboard tests to core team (#51691)

a5bab23

These tests should be owned by core. --------- Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

[CI] Update LLM dependencies list and make the uv compile test job ha…

f4cddf8

…rd fail (#51693) keep things in sync and intact Signed-off-by: kevin <kevin@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master #140

[pull] master from ray-project:master #140

pull bot commented Jun 29, 2023 •

edited

Loading

[pull] master from ray-project:master #140

Are you sure you want to change the base?

[pull] master from ray-project:master #140

Conversation

pull bot commented Jun 29, 2023 • edited Loading

pull bot commented Jun 29, 2023 •

edited

Loading