forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] master from ray-project:master #140
Open
pull
wants to merge
6,529
commits into
garymm:master
Choose a base branch
from
ray-project:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
More work towards breaking the core -> libraries dependencies. - `test_output.py`: changed to use ray core API instead. - `test_client_library_integration.py`: moved to a test inside rllib directory. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Train v2 has a new way of doing Train + Tune interop. This PR adds the new user guide while keeping the old user guide as an orphan page. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
This PR adds log file paths to the Train state schema, for both the controller and workers. These will be user facing application logs. ## Changes - Added `controller_log_file_path` field to `TrainRun` and `log_file_path` to `TrainWorker` schema classes - Added corresponding fields to protobuf messages for state export API - Implemented utility methods to retrieve log file paths for controller and workers - Note that these should be called after the loggers are initialized - Added functionality to assign log file paths to workers in the worker group --------- Signed-off-by: Matthew Deng <matt@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This PR moves the checking logic for whether the pyproject.toml etc. are inside the working directory. The checks are now only done if the working_dir is not explicitly set by the user. This is to make sure the checks are not happening on the job submission codepath, since in that case the working_dir could have been zipped up by the user / on the client side, so the files are not available in the runtime environment hook. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: pcmoritz <pcmoritz@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…or not (#51158) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Ray core worker processes change the process title by calling `setproctitle` before and after executing tasks. Then, `verify_tasks_running_or_terminated` uses the following logic to check whether the Ray actor task is running or not: ```python if psutil.pid_exists(pid) and task_name in psutil.Process(pid).name(): ``` However, the logic is not correct in some cases. To understand the issue, we need to understand the behavior of `setproctitle` and `psutil.Process(pid).name()`. * `setproctitle`: * It calls functions like [prctl](https://github.com/dvarrazzo/py-setproctitle/blob/cacf96fafa3da1cd1a5b131b4f8b9997c01518d5/src/spt_status.c#L390) based on the OS and arch under the hood. It updates `/proc/pid/task/tid/comm` of the calling thread. * If the thread is the leader thread (i.e. main thread in Python), `/proc/pid/comm` will also be updated. * `/proc/pid/cmdline` and `/proc/pid/task/tid/cmdline` will always be updated based on my observation. * Note that `comm` has a length limitation (15 chars) * `psutil.Process(pid).name()` * https://github.com/giampaolo/psutil/blob/a17550784b0d3175da01cdb02cee1bc6b61637dc/psutil/__init__.py#L664-L693 * Try to retrieve `/proc/pid/comm`. * If `/proc/pid/comm` is truncated, attempt to retrieve the first element of `/proc/pid/cmdline`; otherwise, return `/proc/pid/comm`. * If `cmdline[0]` begins with the contents of `/proc/pid/comm`, return `cmdline[0]`; otherwise, return the truncated `/proc/pid/comm`. If the actor task is not running on the main thread (in which case `setproctitle` will not update `/proc/pid/comm`), and if users pass an `options.name` to the actor that doesn't start with `/proc/pid/comm`, then `task_name in psutil.Process(pid).name()` will evaluate to false, which is incorrect. ```python def test_default_thread_options_name_long_comm_2(self, ray_start_regular): """ `/proc/PID/comm` is truncated to 15 characters, so the process title is "ray::VeryLongCo". `psutil.Process.name()` doesn't return `cmdline()[0]` because `cmdline()[0]` doesn't start with "ray::VeryLongCo". This is the implementation detail of `psutil`. """ task_name = "hello" @ray.remote(concurrency_groups={"io": 1}) class VeryLongCommActor: def check_long_comm(self): pid = os.getpid() assert _is_actor_task_running(pid, task_name) # The first 15 characters of "ray::VeryLongCommActor" assert psutil.Process(pid).name() == "ray::VeryLongCo" assert psutil.Process(pid).cmdline()[0] == f"ray::{task_name}" return pid a = VeryLongCommActor.remote() pid = ray.get(a.check_long_comm.options(name=task_name).remote()) wait_for_condition(lambda: not _is_actor_task_running(pid, task_name)) ``` ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kaihsun <kaihsun@anyscale.com>
…tructure (#51112) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Splits the tests for library-specific telemetry data into their own test files. Each one has a similarly-structured test with a shared utility for common logic. This exercise many issues in the telemetry that we need to address in follow ups: - #51163 - #51164 - #51165 - #51166 - #51167 - #51168 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…lease tests (#51176) --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Removing dependencies on libraries from core. There is nothing specific to tune in this test case and the logic is well-covered elsewhere. We should prefer to have more targeted functional tests within core. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
… the trainer (#51093) User can forget to pass in a `ScalingConfig` at the moment, which results in a confusing error. Instead, create a default `ScalingConfig` that defaults to a single CPU worker. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
AsyncDrainNode is unused --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs. ## Related issue number Related to issues: - #50134 - #50452 Also related to #47742 --------- Signed-off-by: Amjad Almahairi <anm@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…e`s (#51090) Currently, a `GPUFuture` contains a recorded CUDA event. When the `GPUFuture` is garbage-collected, its event is also garbage-collected, at which point `cupy` destroys that CUDA event. This is problematic because the event might be destroyed after other CUDA resources. In particular, we found that the overlapping test in `test_torch_tensor_dag.py` consistently prints out an invalid cuda memory error that occurs during dag teardown. Our hypothesis is that: For an event, CUDA likely stores a pointer to the stream it recorded. Since the CUDA streams are destroyed before the events, the stream pointer is no longer valid when the event is destroyed. This PR fixes the issue by caching an actor's unresolved `GPUFuture`s in its serialization context. After the `GPUFuture` has been waited on, its event is manually destroyed. During teardown, the events inside all unresolved `GPUFuture`s are destroyed before other CUDA resources. --------- Signed-off-by: Yuhan Ruan <andyubryh@gmail.com> Signed-off-by: Weixin Deng <weixin@cs.washington.edu> Co-authored-by: Weixin Deng <weixin@cs.washington.edu>
For Python/CPython, the GIL should be automatically released and reacquired at every [switch interval](https://docs.python.org/3/library/sys.html#sys.setswitchinterval). However, Cython somehow interferes with the GIL, so when we try to call `PyGILState_Release` to release a thread, the following error message is displayed: <img width="1728" alt="image" src="https://github.com/user-attachments/assets/9f138859-ba9e-445a-94ee-02a5fa60f922" /> The error message shows that the current thread (which holds the GIL) is the main thread, but we are trying to release another thread (i.e. default executor). The [CPython source code](https://github.com/python/cpython/blob/475f933ed8b1c9546f1b5497a2241140c7065b5f/Python/pystate.c#L2862) shows that this means that the default executor doesn't hold GIL at that time. In this PR, we explicitly acquire GIL when we call `PyGILState_Release`. Signed-off-by: kaihsun <kaihsun@anyscale.com>
Use smaller `:object_manager` targets as dependences. ## Related issue number Closes #51014 --------- Signed-off-by: Elijah Melton <elimelt@uw.edu>
We have an experimental tensorflow utils library that was moved out of `rllib` and into core in 2019 (thread: #5698). The way it's written makes Ray depend on RLlib (circular dependency). We could refactor this code to move the other related utils into Ray, but as far as I can tell, outside of `rllib`, this code was only used in an example from the documentation that was removed in 2022: #24727 (some of the code was missed, I'm removing it here). We do not expose it in our documentation at all and it lives under `experimental`, so should be safe to remove. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Breaking dependencies on libraries from core. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…51033) Currently, the worker fires-and-forgets a disconnect message to the raylet, then goes through its shutdown procedure and closes the socket. However, in #50812 we want to add detect for unexpected disconnects on the socket. With the current procedure the worker may send the disconnect and close its socket immediately, causing the raylet to detect the unexpected disconnect prior to processing the graceful disconnect message. This PR adds a `DisconnectClientReply` that the raylet will send once it has processed the disconnect message. The worker blocks until it receives this reply, then proceeds with the disconnect procedure and closes its socket. ## Related issue number Towards #49999 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Pre-merge passed on the last fix just because the tests were already marked flaky, and not all of them were running on pre-merge because of flakiness. Tests already marked flaky only run on post-merge. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
Remove `python/ray/rllib/` directory during core tests to enforce componentization. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
…1192) 1. Currently, the output of `ray status -v` only includes information on node types (i.e., group names in KubeRay) and Ray node IDs. However, it is not easy to map a Ray node ID to the name of the corresponding Ray Pod (i.e. instance id in Autoscaler). <img width="496" alt="Screenshot 2025-03-08 at 11 50 43 PM" src="https://github.com/user-attachments/assets/89c66096-88c2-47fb-80d6-08067c7b9d90" /> 2. Refactor --------- Signed-off-by: kaihsun <kaihsun@anyscale.com>
This is a very old regression test, refactored it to test the underlying core behavior directly. Also removed a test that has been skipped for a long time. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Remove `python/ray/tune/` directory during core tests to enforce componentization. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> * Fix Ray Train release test Arg parsing. * Bump up limit from 500K to 1M rows. Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…e for duration="auto". (#51637)
## Why are these changes needed? Update the list of possible deployment statuses in the docs. ## Related issue number <!-- For example: "Closes #1234" --> --------- Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: akyang-anyscale <alexyang@anyscale.com>
#51645) Flaky on windows Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
Enable CI environment with writeable cgroupv2 in CI. To use this environment for testing, you need to use the `--privileged-container` flag in your test definitions. To enable bazel read/write to sys/fs/cgroup in your tests, you need to use the `--build-type cgroup` --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Signed-off-by: Abrar Sheikh <abrar@anyscale.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
## Why are these changes needed? This PR removes a feature flag that controls whether the proxy should use cached replica queue length values for routing. The FF was [introduced](#42943) over a year ago as a way for users to quickly switch back to the previous implementation. It has been enabled by default for [over a year](#43169) now and works as expected, so let's remove it. Consequently, this PR also removes `RAY_SERVE_ENABLE_STRICT_MAX_ONGOING_REQUESTS`, as it is always enabled if `RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE` is enabled. Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? The current Databricks integration in Ray Data requires providing the Databricks host URL without the "https://" prefix. However, this creates compatibility issues when using Ray Data alongside MLflow, as MLflow's Databricks integration (which uses the same DATABRICKS_HOST environment variable) expects the URL to include the "https://" prefix. ## Related issue number Closes #49925 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Gil Leibovitz <gil.leibovitz@doubleverify.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Gil Leibovitz <gil.leibovitz@doubleverify.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
…51433) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> Update repartition on target_num_rows_per_block: When `target_num_rows_per_block` is set, it only repartitions Dataset blocks that are larger than `target_num_rows_per_block`. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
uv much much faster also adds tests to enforce dependency integrity. Signed-off-by: kevin <kevin@anyscale.com>
so that "lint" tag can be used for just lints, and will not be abused to be used as "always" Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…51668) Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: jukejian <jukejian@bytedance.com>
…or missing keys (#44769) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> As stated in #44768, the current implementation of `multiget` based on `np.searchsorted` does not check for missing keys. I added the required checks and updated unit test for this case. ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> Closes #44768 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Wu Yufei <wuyufei.2000@bytedance.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This PR adds async generator support to `flat_map`. The implementation is similar to how #46129 handled async callable classes for map_batches(), changes include: * Generalize the logic in `_generate_transform_fn_for_async_flat_map` so it can process both batches and rows * Add test case for async `flat_map` <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number #50329 <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Drice1999 <chenxh267@gmail.com>
cgroup tests access (read and write) system directory, so they shouldn't be executed in parallel. For example, `bazel test //src/ray/common/cgroup/tests:all` should run tests one by one. --------- Signed-off-by: dentiny <dentinyhao@gmail.com>
…nals (#51582) If a threaded actor receives exit signals twice as shown in the above screenshot, it will execute [task_receiver_->Stop()](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/core_worker.cc#L1224) twice. However, the second call to `task_receiver_->Stop()` will get stuck forever when executing the [releaser](https://github.com/ray-project/ray/blob/6bb9cef9257046ae31f78f6c52015a8ebf009f81/src/ray/core_worker/transport/concurrency_group_manager.cc#L135). ### Reproduction ```sh ray start --head --include-dashboard=True --num-cpus=1 # https://gist.github.com/kevin85421/7a42ac3693537c2148fa554065bb5223 python3 test.py # Some actors are still ALIVE. If all actors are DEAD, increase the number of actors. ray list actors ``` --------- Signed-off-by: kaihsun <kaihsun@anyscale.com>
This PR fixes all variable shadowing for core worker. Variable shadowing is known to be error prune --- I personally just spent 20 minute debugging an issue caused by it. Need to address all issues before #51669 Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: Linkun Chen <github@lkchen.net>
When doing release benchmark comparison, I found the script is performing comparison on **full filepaths** rather than **release result filenames**. Signed-off-by: dentiny <dentinyhao@gmail.com>
Resize is pretty inefficient default constructing all these in the case of a lot of args. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
While looking at a dashboard flame graph, memory_full_info was actually visible even though it only runs every 5 seconds. Our two calls mean we're getting the info twice so even more expensive, and it can also result in inconsistent info between uss and rss. We also get an additional speedup by using with oneshot() with getting cpu percent as well. https://psutil.readthedocs.io/en/latest/#psutil.Process.oneshot. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
…50375) For an easy-to-use abstraction that'll allow use of a flat map with reader/writer locks without thinking about thread safety or correct use. Also see discussion here for thread-safe gcs node manager #50024 (comment). --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Dhyey Shah <dhyey2019@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
These tests should be owned by core. --------- Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Current implementation is, we create multiple grpc stubs but not TCP connections. I did some benchmark, which shows multiple grpc stub does improvement performance so we cannot simply remove them. Baseline: #51656 Benchmark result: ```sh ubuntu@hjiang-devbox-pg$ python3 /home/ubuntu/ray/release/release_logs/compare_perf_metrics /home/ubuntu/ray/release/release_logs/base_version /home/ubuntu/ray/release/release_logs/new_version REGRESSION 14.66%: single_client_tasks_and_get_batch (THROUGHPUT) regresses from 5.969306035542166 to 5.094453847363354 (14.66%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 14.51%: tasks_per_second (THROUGHPUT) regresses from 229.36388671973262 to 196.07926002040125 (14.51%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_nodes.json REGRESSION 12.24%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.238079759644892 to 11.61748355563387 (12.24%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 6.99%: multi_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 16841.57415006188 to 15663.994106069527 (6.99%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 6.66%: single_client_wait_1k_refs (THROUGHPUT) regresses from 5.0308320360232734 to 4.695580178438465 (6.66%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 3.27%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8271.004793738302 to 8000.250850666334 (3.27%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 2.93%: client__tasks_and_put_batch (THROUGHPUT) regresses from 14039.49342189659 to 13628.021852828302 (2.93%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 2.57%: 1_n_async_actor_calls_async (THROUGHPUT) regresses from 7549.855764807742 to 7355.844212331484 (2.57%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 2.51%: n_n_async_actor_calls_async (THROUGHPUT) regresses from 23671.11885393803 to 23076.0705989556 (2.51%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 2.48%: single_client_tasks_sync (THROUGHPUT) regresses from 952.0269029041373 to 928.4100648169424 (2.48%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 2.27%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2786.0434704559507 to 2722.7190863593105 (2.27%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 1.89%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 523.9186869126953 to 514.0347539070235 (1.89%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 1.48%: placement_group_create/removal (THROUGHPUT) regresses from 755.9036396341403 to 744.686404601542 (1.48%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 1.38%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1055.293613301056 to 1040.6856076865438 (1.38%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 1.21%: client__tasks_and_get_batch (THROUGHPUT) regresses from 0.963509372266668 to 0.9518418662213537 (1.21%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 1.00%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 4998.683273892644 to 4948.940736644676 (1.00%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.95%: n_n_actor_calls_async (THROUGHPUT) regresses from 27023.396892676123 to 26767.1762953109 (0.95%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.94%: single_client_put_gigabytes (THROUGHPUT) regresses from 19.0889649635755 to 18.909259900586363 (0.94%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.72%: 1_1_async_actor_calls_async (THROUGHPUT) regresses from 4642.895287623349 to 4609.41957458445 (0.72%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.71%: single_client_tasks_async (THROUGHPUT) regresses from 7909.367644778592 to 7852.882247928618 (0.71%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.37%: single_client_get_calls_Plasma_Store (THROUGHPUT) regresses from 10536.660461092333 to 10497.924384397613 (0.37%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.31%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1045.148463125327 to 1041.9241552653498 (0.31%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.14%: 1_1_actor_calls_concurrent (THROUGHPUT) regresses from 5161.152587049948 to 5154.097793024906 (0.14%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 0.08%: client__put_calls (THROUGHPUT) regresses from 774.1865536687783 to 773.580427400101 (0.08%) in /home/ubuntu/ray/release/release_logs/new_version/microbenchmark.json REGRESSION 29.16%: dashboard_p95_latency_ms (LATENCY) regresses from 26.737 to 34.534 (29.16%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_nodes.json REGRESSION 28.52%: dashboard_p99_latency_ms (LATENCY) regresses from 3058.247 to 3930.337 (28.52%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_actors.json REGRESSION 16.62%: stage_3_creation_time (LATENCY) regresses from 1.8721938133239746 to 2.183365821838379 (16.62%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_many_tasks.json REGRESSION 11.56%: dashboard_p99_latency_ms (LATENCY) regresses from 277.091 to 309.116 (11.56%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_pgs.json REGRESSION 9.37%: dashboard_p99_latency_ms (LATENCY) regresses from 77.557 to 84.822 (9.37%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_nodes.json REGRESSION 5.97%: avg_pg_remove_time_ms (LATENCY) regresses from 1.3214836546547326 to 1.4004240105108736 (5.97%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_placement_group.json REGRESSION 5.43%: dashboard_p50_latency_ms (LATENCY) regresses from 5.399 to 5.692 (5.43%) in /home/ubuntu/ray/release/release_logs/new_version/benchmarks/many_nodes.json REGRESSION 3.73%: avg_iteration_time (LATENCY) regresses from 1.1903677296638489 to 1.234775812625885 (3.73%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_dead_actors.json REGRESSION 1.09%: stage_1_avg_iteration_time (LATENCY) regresses from 12.530043482780457 to 12.667147731781006 (1.09%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_many_tasks.json REGRESSION 0.48%: stage_2_avg_iteration_time (LATENCY) regresses from 38.180511474609375 to 38.36462712287903 (0.48%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_many_tasks.json REGRESSION 0.32%: stage_3_time (LATENCY) regresses from 1884.1879193782806 to 1890.2630755901337 (0.32%) in /home/ubuntu/ray/release/release_logs/new_version/stress_tests/stress_test_many_tasks.json REGRESSION 0.05%: 10000_args_time (LATENCY) regresses from 19.075126932999993 to 19.085101717 (0.05%) in /home/ubuntu/ray/release/release_logs/new_version/scalability/single_node.json ``` Reproduction commands: ```sh BUILDKITE_TOKEN=bkua_77ebf3852a34e264e48619dab87574f634e9e4f7 python3 /home/ubuntu/ray/release/release_logs/fetch_release_logs.py new_version e00693c dentiny:hjiang/fix-client-conn BUILDKITE_TOKEN=bkua_77ebf3852a34e264e48619dab87574f634e9e4f7 python3 /home/ubuntu/ray/release/release_logs/fetch_release_logs.py base_version b92dcd1 dentiny:hjiang/baseline-release-benchmark python3 /home/ubuntu/ray/release/release_logs/compare_perf_metrics /home/ubuntu/ray/release/release_logs/base_version /home/ubuntu/ray/release/release_logs/new_version ``` Signed-off-by: dentiny <dentinyhao@gmail.com>
…rd fail (#51693) keep things in sync and intact Signed-off-by: kevin <kevin@anyscale.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )