[core] Changing default tensor serialization in compiled graphs #50778

anmscale · 2025-02-21T00:21:13Z

Why are these changes needed?

Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs.

Related issue number

Related to issues:

[core] Deserialize torch.Tensors to the correct device #50134
Compiled Graphs torch.Tensor serialization device #50452
Also related to [core][aDAG] Fix cpu tensor is automatically converted to gpu tensor #47742

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

anmscale · 2025-02-21T01:13:52Z

Context

These tests clarify the difference between Ray core and complied graphs in transferring torch.Tensor between two Ray actors. Let's assume we want to transfer tensor t between a source and destination. In short:

Ray core tries to keep the tensor on the same device it was originally on. If at destination the device is not available, it throws an error
Compiled graphs implicitly converts the device to be on default device of the actor

Test scope

The source/destination can be one of:
- driver (default device is cpu)
- cpu-only actor (default device is cpu)
- gpu actor (default device is cuda:0).
t.device can be either cpu or cuda:0.
I also test a dictionary of mixed cpu and gpu tensors.

Issues

Ray core and compiled graphs are completely unaligned which is problematic for users. E.g. forgetting the tensor type hint will significantly change the behavior.
Compiled graphs implicitly converts cpu tensor to a gpu one (if the destination's default device is gpu) and vice-versa.
Transferring data structure with mixed cpu and gpu tensors in compiled graph will result in converting all tensors to the same device.
Changing the Ray core behavior can cause breaking changes and cause many downstream errors
The "default device" of an actor is not used Ray core serialization. Also, current the API doesn't allow setting a different default device for an actor.
Compiled graphs doesn't allow actors with multiple GPUs, so we limit the test cases to 1 GPU actors in compiled graphs.

Proposal

Keep behavior of Ray core unchanged.
Remove implicit device conversion in compiled graphs done by default. Hence aligning behavior between core and compiled graphs and supporting mixed device transfers.
Add explicit support to allow users to specify the destination's device. Basically follow proposed API here: [core][aDAG] Fix cpu tensor is automatically converted to gpu tensor #47742
(potential) enable multi-gpu support in single actor in compiled graphs and handle serialization to a specific device.

edoakes · 2025-02-21T23:40:02Z

Ray core tries to keep the tensor on the same device it was originally on. If at destination the device is not available, it throws an error

@anmscale I know that CG doesn't support multiple devices per actor atm, but we should also consider that behavior because:

Ray Core does (and we should try to make all of the behavior consistent holistically).
We will likely want to lift the restriction in CG the future.

Could you add this to the proposal? (including current core behavior -- I know we discussed it somewhere but now I don't remember...)

anmscale · 2025-02-24T18:18:58Z

Could you add this to the proposal? (including current core behavior -- I know we discussed it somewhere but now I don't remember...)

Updated the proposal and added a test. Note that I'm not sure what's the context for not supporting multiple-gpus in compiled graph actors. It might not be a high priority issue though because we usually assign a single GPU per actor.

jjyao · 2025-02-25T20:41:11Z

cc @edoakes @kevin85421 for reviews

anmscale · 2025-02-25T21:09:36Z

cc @edoakes @kevin85421 for reviews

I spoke with @stephanie-wang and she prefers that I merge this PR with the one I planned next for the code change. Will update here shortly

edoakes · 2025-02-26T22:28:43Z

cc @edoakes @kevin85421 for reviews

I spoke with @stephanie-wang and she prefers that I merge this PR with the one I planned next for the code change. Will update here shortly

Just double checking you are not blocked atm

anmscale · 2025-02-26T22:43:15Z

Just double checking you are not blocked atm

~~I am actually blocked now, but will meet with Rui and discuss the issue I'm facing shortly~~

Just pushed the new behavior in a new commit. Still fixing some unit tests

edoakes

Is the device_policy inherited across multiple calls in the DAG or does it only apply to one edge? If it only applies to one edge, then defining an alternative policy does not provide any utility -- it would be equally concise and expressive to require the user to explicitly specify the target device instead (target_device="cuda" instead of device_policy="default").

edoakes · 2025-02-27T15:38:33Z

python/ray/dag/dag_node.py

@@ -141,17 +142,20 @@ def _collect_upstream_nodes(self) -> List["DAGNode"]:
    def with_tensor_transport(
        self,
        transport: Optional[Union[str, Communicator]] = "auto",
+        device_policy: Optional[Literal["auto", "default"]] = "auto",


it's exceptionally confusing that "default" is not the default 🙃

perhaps it would be more clear to call it "default_device" instead?

agreed, this is a better option yet a little too long. Will change to that.

Let's use device instead of device_policy. It is shorter and conveys the same meaning.

Here is another suggestion for the options:

"retain": Default option. Tensors always moved to a device on the receiver that matches the original device on the sender, if such a device is visible to the receiver. Also, I would make this the default option.

"auto": Tensors always moved to receiver's default device.

Sounds good to me. Maybe we should in the docstring what default device is, i.e. the torch_device defined in ChannelContext

python/ray/exceptions.py

python/ray/experimental/channel/serialization_context.py

edoakes · 2025-02-27T15:41:42Z

python/ray/experimental/channel/torch_tensor_type.py

@@ -22,6 +22,7 @@ class TorchTensorType(ChannelOutputType):
    def __init__(
        self,
        transport: Optional[Union[str, Communicator]] = AUTO,
+        device_policy: Literal["auto", "default"] = "auto",


This should be defined as a string Enum (which can then easily be publicly documented and extended)

Makes sense, will change that too.

Doesn't this apply to transport as well? I think we can change the internal API to use DevicePolicy but keep the user facing one (e.g. with_tensor_transport) use a string value. Makes sense?

anmscale · 2025-02-27T17:16:20Z

Is the device_policy inherited across multiple calls in the DAG or does it only apply to one edge? If it only applies to one edge, then defining an alternative policy does not provide any utility -- it would be equally concise and expressive to require the user to explicitly specify the target device instead (target_device="cuda" instead of device_policy="default").

here's my thinking:
1- I chose default/auto vs. cuda/cpu because the same edge can transfer a collection of tensors, each on a different device. The latter option is not expressive enough, so I chose the former.
2- I think the device_policy should be flexible and defined to a single edge.

stephanie-wang

I still think using the following options would lead to better default behavior:

"auto": (default) If a tensor matches its sender's default device, then the receiver will deserialize the tensor to its default device. All other tensors will be deserialized to an equivalent device on the receiver, and an error is thrown if the device doesn't exist.
"retain": Tensors will be deserialized to an equivalent device on the receiver, and an error is thrown if the device doesn't exist.
"cpu"
"cuda" and/or "gpu"

The main reason is for driver <> actor tensors. I think it will be common for the driver to want to read/write some data from the actors, and it may not have a GPU available. In that case, the current default in this PR will throw an error.

stephanie-wang · 2025-02-27T23:04:46Z

python/ray/dag/dag_node.py

@@ -141,17 +142,20 @@ def _collect_upstream_nodes(self) -> List["DAGNode"]:
    def with_tensor_transport(
        self,
        transport: Optional[Union[str, Communicator]] = "auto",
+        device_policy: Optional[Literal["auto", "default"]] = "auto",


Let's use device instead of device_policy. It is shorter and conveys the same meaning.

Here is another suggestion for the options:

"retain": Default option. Tensors always moved to a device on the receiver that matches the original device on the sender, if such a device is visible to the receiver. Also, I would make this the default option.

"auto": Tensors always moved to receiver's default device.

python/ray/experimental/channel/serialization_context.py

Signed-off-by: Amjad Almahairi <anm@anyscale.com>

kevin85421

maybe we also need to test:

input / output type hints of the same DAG node are different
different input type hints for the same DAG node

python/ray/dag/tests/experimental/test_torch_tensor_transport.py

python/ray/experimental/util/types.py

python/ray/dag/compiled_dag_node.py

python/ray/dag/tests/experimental/test_mocked_nccl_dag.py

python/ray/tests/test_nccl_channel.py

python/ray/dag/dag_node.py

python/ray/dag/tests/experimental/test_torch_tensor_transport.py

python/ray/experimental/channel/torch_tensor_type.py

python/ray/dag/tests/experimental/test_torch_tensor_transport.py

Signed-off-by: Amjad Almahairi <anm@anyscale.com>

python/ray/experimental/util/types.py

python/ray/dag/tests/experimental/test_torch_tensor_transport.py

python/ray/experimental/channel/torch_tensor_type.py

Fix unit test issues Signed-off-by: Amjad Almahairi <anm@anyscale.com>

Signed-off-by: Amjad Almahairi <anm@anyscale.com>

Signed-off-by: hipudding <huafengchun@gmail.com>

…project#50778) Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs. ## Related issue number Related to issues: - ray-project#50134 - ray-project#50452 Also related to ray-project#47742 --------- Signed-off-by: Amjad Almahairi <anm@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Signed-off-by: hipudding <huafengchun@gmail.com>

…project#50778) Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs. ## Related issue number Related to issues: - ray-project#50134 - ray-project#50452 Also related to ray-project#47742 --------- Signed-off-by: Amjad Almahairi <anm@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>

…project#50778) Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs. ## Related issue number Related to issues: - ray-project#50134 - ray-project#50452 Also related to ray-project#47742 --------- Signed-off-by: Amjad Almahairi <anm@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…project#50778) Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs. ## Related issue number Related to issues: - ray-project#50134 - ray-project#50452 Also related to ray-project#47742 --------- Signed-off-by: Amjad Almahairi <anm@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>

anmscale assigned anmscale, stephanie-wang, edoakes and kevin85421 and unassigned anmscale Feb 21, 2025

anmscale added the core Issues that should be addressed in Ray Core label Feb 21, 2025

anmscale force-pushed the tensor-transport-tests branch 4 times, most recently from 10db8c6 to f0f6e82 Compare February 21, 2025 20:06

anmscale force-pushed the tensor-transport-tests branch 2 times, most recently from b10a6ff to 5bfa90e Compare February 24, 2025 18:03

anmscale force-pushed the tensor-transport-tests branch 3 times, most recently from 08b031e to bb919bb Compare February 24, 2025 21:29

jjyao added the go add ONLY when ready to merge, run all tests label Feb 25, 2025

anmscale force-pushed the tensor-transport-tests branch from 77ee2b2 to 604c188 Compare February 27, 2025 06:16

anmscale requested a review from a team as a code owner February 27, 2025 06:16

edoakes reviewed Feb 27, 2025

View reviewed changes

stephanie-wang reviewed Feb 27, 2025

View reviewed changes

anmscale force-pushed the tensor-transport-tests branch from 624e001 to 6996452 Compare February 27, 2025 23:57

anmscale changed the title ~~[core] Unit tests for tensor serialization~~ [core] Changing default tensor serialization in compiled graphs Feb 27, 2025

address comments

745f08b

Signed-off-by: Amjad Almahairi <anm@anyscale.com>

anmscale force-pushed the tensor-transport-tests branch from 4b1a82b to 745f08b Compare March 4, 2025 23:03

minor changes

49e8d7e

Signed-off-by: Amjad Almahairi <anm@anyscale.com>

anmscale force-pushed the tensor-transport-tests branch from aadebad to 49e8d7e Compare March 4, 2025 23:35

better organized unit tests

482f6de

Signed-off-by: Amjad Almahairi <anm@anyscale.com>

kevin85421 reviewed Mar 6, 2025

View reviewed changes

python/ray/dag/tests/experimental/test_torch_tensor_transport.py Show resolved Hide resolved

fixed mocked tests by mocking target device

3ffad24

Signed-off-by: Amjad Almahairi <anm@anyscale.com>

anmscale force-pushed the tensor-transport-tests branch from c5f17bd to 3ffad24 Compare March 7, 2025 01:08

anmscale added 2 commits March 7, 2025 15:23

fix build file

dfe978f

Signed-off-by: Amjad Almahairi <anm@anyscale.com>

more changes

0153d8a

Signed-off-by: Amjad Almahairi <anm@anyscale.com>

anmscale force-pushed the tensor-transport-tests branch from 76f7bba to 0153d8a Compare March 7, 2025 16:02

edoakes reviewed Mar 7, 2025

View reviewed changes

python/ray/experimental/util/types.py Outdated Show resolved Hide resolved

python/ray/dag/tests/experimental/test_torch_tensor_transport.py Outdated Show resolved Hide resolved

stephanie-wang approved these changes Mar 8, 2025

View reviewed changes

anmscale and others added 5 commits March 8, 2025 23:30

nccl<>cpu invalid.

a876e90

Fix unit test issues Signed-off-by: Amjad Almahairi <anm@anyscale.com>

reduce duplicate code in unit test

b40f76d

Signed-off-by: Amjad Almahairi <anm@anyscale.com>

stability alpha

849650f

Signed-off-by: Amjad Almahairi <anm@anyscale.com>

stateless device and improve mocked tests

7f29137

Signed-off-by: Amjad Almahairi <anm@anyscale.com>

Merge branch 'master' into tensor-transport-tests

034ed77

edoakes merged commit 6baecd0 into ray-project:master Mar 9, 2025
5 checks passed

hipudding mentioned this pull request Mar 10, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support #51032

Open

8 tasks

hipudding added a commit to hipudding/ray that referenced this pull request Mar 11, 2025

Adjust with ray-project#50778

1abb55a

Signed-off-by: hipudding <huafengchun@gmail.com>

hipudding added a commit to hipudding/ray that referenced this pull request Mar 11, 2025

Adjust with ray-project#50778

08b8c31

Signed-off-by: hipudding <huafengchun@gmail.com>

hipudding added a commit to hipudding/ray that referenced this pull request Mar 12, 2025

Adjust with ray-project#50778

5598ff9

Signed-off-by: hipudding <huafengchun@gmail.com>

hipudding added a commit to hipudding/ray that referenced this pull request Mar 18, 2025

Adjust with ray-project#50778

673baaa

Signed-off-by: hipudding <huafengchun@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Changing default tensor serialization in compiled graphs #50778

[core] Changing default tensor serialization in compiled graphs #50778

anmscale commented Feb 21, 2025 •

edited

Loading

anmscale commented Feb 21, 2025 •

edited

Loading

edoakes commented Feb 21, 2025

anmscale commented Feb 24, 2025

jjyao commented Feb 25, 2025

anmscale commented Feb 25, 2025

edoakes commented Feb 26, 2025

anmscale commented Feb 26, 2025 •

edited

Loading

edoakes left a comment •

edited

Loading

edoakes Feb 27, 2025

edoakes Feb 27, 2025

anmscale Feb 27, 2025

stephanie-wang Feb 27, 2025

anmscale Feb 28, 2025

edoakes Feb 27, 2025 •

edited

Loading

anmscale Feb 27, 2025

anmscale Feb 27, 2025

anmscale commented Feb 27, 2025

stephanie-wang left a comment

stephanie-wang Feb 27, 2025

kevin85421 left a comment

[core] Changing default tensor serialization in compiled graphs #50778

[core] Changing default tensor serialization in compiled graphs #50778

Conversation

anmscale commented Feb 21, 2025 • edited Loading

Why are these changes needed?

Related issue number

Checks

anmscale commented Feb 21, 2025 • edited Loading

Context

Test scope

Issues

Proposal

edoakes commented Feb 21, 2025

anmscale commented Feb 24, 2025

jjyao commented Feb 25, 2025

anmscale commented Feb 25, 2025

edoakes commented Feb 26, 2025

anmscale commented Feb 26, 2025 • edited Loading

edoakes left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edoakes Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anmscale commented Feb 27, 2025

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

anmscale commented Feb 21, 2025 •

edited

Loading

anmscale commented Feb 21, 2025 •

edited

Loading

anmscale commented Feb 26, 2025 •

edited

Loading

edoakes left a comment •

edited

Loading

edoakes Feb 27, 2025 •

edited

Loading