Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Changing default tensor serialization in compiled graphs #50778

Merged
merged 29 commits into from
Mar 9, 2025

Conversation

anmscale
Copy link
Contributor

@anmscale anmscale commented Feb 21, 2025

Why are these changes needed?

Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs.

Related issue number

Related to issues:

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@anmscale anmscale added the core Issues that should be addressed in Ray Core label Feb 21, 2025
@anmscale
Copy link
Contributor Author

anmscale commented Feb 21, 2025

Context

These tests clarify the difference between Ray core and complied graphs in transferring torch.Tensor between two Ray actors. Let's assume we want to transfer tensor t between a source and destination. In short:

  • Ray core tries to keep the tensor on the same device it was originally on. If at destination the device is not available, it throws an error
  • Compiled graphs implicitly converts the device to be on default device of the actor

Test scope

  • The source/destination can be one of:
    • driver (default device is cpu)
    • cpu-only actor (default device is cpu)
    • gpu actor (default device is cuda:0).
  • t.device can be either cpu or cuda:0.
  • I also test a dictionary of mixed cpu and gpu tensors.

Issues

  1. Ray core and compiled graphs are completely unaligned which is problematic for users. E.g. forgetting the tensor type hint will significantly change the behavior.
  2. Compiled graphs implicitly converts cpu tensor to a gpu one (if the destination's default device is gpu) and vice-versa.
  3. Transferring data structure with mixed cpu and gpu tensors in compiled graph will result in converting all tensors to the same device.
  4. Changing the Ray core behavior can cause breaking changes and cause many downstream errors
  5. The "default device" of an actor is not used Ray core serialization. Also, current the API doesn't allow setting a different default device for an actor.
  6. Compiled graphs doesn't allow actors with multiple GPUs, so we limit the test cases to 1 GPU actors in compiled graphs.

Proposal

  1. Keep behavior of Ray core unchanged.
  2. Remove implicit device conversion in compiled graphs done by default. Hence aligning behavior between core and compiled graphs and supporting mixed device transfers.
  3. Add explicit support to allow users to specify the destination's device. Basically follow proposed API here: [core][aDAG] Fix cpu tensor is automatically converted to gpu tensor #47742
  4. (potential) enable multi-gpu support in single actor in compiled graphs and handle serialization to a specific device.

@anmscale anmscale force-pushed the tensor-transport-tests branch 4 times, most recently from 10db8c6 to f0f6e82 Compare February 21, 2025 20:06
@edoakes
Copy link
Collaborator

edoakes commented Feb 21, 2025

  • Ray core tries to keep the tensor on the same device it was originally on. If at destination the device is not available, it throws an error

@anmscale I know that CG doesn't support multiple devices per actor atm, but we should also consider that behavior because:

  • Ray Core does (and we should try to make all of the behavior consistent holistically).
  • We will likely want to lift the restriction in CG the future.

Could you add this to the proposal? (including current core behavior -- I know we discussed it somewhere but now I don't remember...)

@anmscale anmscale force-pushed the tensor-transport-tests branch 2 times, most recently from b10a6ff to 5bfa90e Compare February 24, 2025 18:03
@anmscale
Copy link
Contributor Author

Could you add this to the proposal? (including current core behavior -- I know we discussed it somewhere but now I don't remember...)

Updated the proposal and added a test. Note that I'm not sure what's the context for not supporting multiple-gpus in compiled graph actors. It might not be a high priority issue though because we usually assign a single GPU per actor.

@anmscale anmscale force-pushed the tensor-transport-tests branch 3 times, most recently from 08b031e to bb919bb Compare February 24, 2025 21:29
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Feb 25, 2025
@jjyao
Copy link
Collaborator

jjyao commented Feb 25, 2025

cc @edoakes @kevin85421 for reviews

@anmscale
Copy link
Contributor Author

cc @edoakes @kevin85421 for reviews

I spoke with @stephanie-wang and she prefers that I merge this PR with the one I planned next for the code change. Will update here shortly

@edoakes
Copy link
Collaborator

edoakes commented Feb 26, 2025

cc @edoakes @kevin85421 for reviews

I spoke with @stephanie-wang and she prefers that I merge this PR with the one I planned next for the code change. Will update here shortly

Just double checking you are not blocked atm

@anmscale
Copy link
Contributor Author

anmscale commented Feb 26, 2025

Just double checking you are not blocked atm

I am actually blocked now, but will meet with Rui and discuss the issue I'm facing shortly

Just pushed the new behavior in a new commit. Still fixing some unit tests

@anmscale anmscale force-pushed the tensor-transport-tests branch from 77ee2b2 to 604c188 Compare February 27, 2025 06:16
@anmscale anmscale requested a review from a team as a code owner February 27, 2025 06:16
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the device_policy inherited across multiple calls in the DAG or does it only apply to one edge? If it only applies to one edge, then defining an alternative policy does not provide any utility -- it would be equally concise and expressive to require the user to explicitly specify the target device instead (target_device="cuda" instead of device_policy="default").

@@ -141,17 +142,20 @@ def _collect_upstream_nodes(self) -> List["DAGNode"]:
def with_tensor_transport(
self,
transport: Optional[Union[str, Communicator]] = "auto",
device_policy: Optional[Literal["auto", "default"]] = "auto",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's exceptionally confusing that "default" is not the default 🙃

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it would be more clear to call it "default_device" instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, this is a better option yet a little too long. Will change to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use device instead of device_policy. It is shorter and conveys the same meaning.

Here is another suggestion for the options:

  • "retain": Default option. Tensors always moved to a device on the receiver that matches the original device on the sender, if such a device is visible to the receiver. Also, I would make this the default option.
  • "auto": Tensors always moved to receiver's default device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Maybe we should in the docstring what default device is, i.e. the torch_device defined in ChannelContext

@@ -22,6 +22,7 @@ class TorchTensorType(ChannelOutputType):
def __init__(
self,
transport: Optional[Union[str, Communicator]] = AUTO,
device_policy: Literal["auto", "default"] = "auto",
Copy link
Collaborator

@edoakes edoakes Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be defined as a string Enum (which can then easily be publicly documented and extended)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will change that too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this apply to transport as well? I think we can change the internal API to use DevicePolicy but keep the user facing one (e.g. with_tensor_transport) use a string value. Makes sense?

@anmscale
Copy link
Contributor Author

Is the device_policy inherited across multiple calls in the DAG or does it only apply to one edge? If it only applies to one edge, then defining an alternative policy does not provide any utility -- it would be equally concise and expressive to require the user to explicitly specify the target device instead (target_device="cuda" instead of device_policy="default").

here's my thinking:
1- I chose default/auto vs. cuda/cpu because the same edge can transfer a collection of tensors, each on a different device. The latter option is not expressive enough, so I chose the former.
2- I think the device_policy should be flexible and defined to a single edge.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think using the following options would lead to better default behavior:

  • "auto": (default) If a tensor matches its sender's default device, then the receiver will deserialize the tensor to its default device. All other tensors will be deserialized to an equivalent device on the receiver, and an error is thrown if the device doesn't exist.
  • "retain": Tensors will be deserialized to an equivalent device on the receiver, and an error is thrown if the device doesn't exist.
  • "cpu"
  • "cuda" and/or "gpu"

The main reason is for driver <> actor tensors. I think it will be common for the driver to want to read/write some data from the actors, and it may not have a GPU available. In that case, the current default in this PR will throw an error.

@@ -141,17 +142,20 @@ def _collect_upstream_nodes(self) -> List["DAGNode"]:
def with_tensor_transport(
self,
transport: Optional[Union[str, Communicator]] = "auto",
device_policy: Optional[Literal["auto", "default"]] = "auto",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use device instead of device_policy. It is shorter and conveys the same meaning.

Here is another suggestion for the options:

  • "retain": Default option. Tensors always moved to a device on the receiver that matches the original device on the sender, if such a device is visible to the receiver. Also, I would make this the default option.
  • "auto": Tensors always moved to receiver's default device.

@anmscale anmscale force-pushed the tensor-transport-tests branch from 624e001 to 6996452 Compare February 27, 2025 23:57
@anmscale anmscale changed the title [core] Unit tests for tensor serialization [core] Changing default tensor serialization in compiled graphs Feb 27, 2025
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
@anmscale anmscale force-pushed the tensor-transport-tests branch from 4b1a82b to 745f08b Compare March 4, 2025 23:03
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
@anmscale anmscale force-pushed the tensor-transport-tests branch from aadebad to 49e8d7e Compare March 4, 2025 23:35
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we also need to test:

  1. input / output type hints of the same DAG node are different
  2. different input type hints for the same DAG node

Signed-off-by: Amjad Almahairi <anm@anyscale.com>
@anmscale anmscale force-pushed the tensor-transport-tests branch from c5f17bd to 3ffad24 Compare March 7, 2025 01:08
anmscale added 2 commits March 7, 2025 15:23
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
@anmscale anmscale force-pushed the tensor-transport-tests branch from 76f7bba to 0153d8a Compare March 7, 2025 16:02
anmscale and others added 5 commits March 8, 2025 23:30
Fix unit test issues

Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
@edoakes edoakes merged commit 6baecd0 into ray-project:master Mar 9, 2025
5 checks passed
hipudding added a commit to hipudding/ray that referenced this pull request Mar 11, 2025
Signed-off-by: hipudding <huafengchun@gmail.com>
hipudding added a commit to hipudding/ray that referenced this pull request Mar 11, 2025
Signed-off-by: hipudding <huafengchun@gmail.com>
hipudding added a commit to hipudding/ray that referenced this pull request Mar 12, 2025
Signed-off-by: hipudding <huafengchun@gmail.com>
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
…project#50778)

Changing the default tensor serialization in compiled graphs. Also added
a comprehensive set of unit tests covering cases for torch.Tensor
serialization in both Ray core and compiled graphs.

## Related issue number

Related to issues:
  - ray-project#50134
  - ray-project#50452
Also related to ray-project#47742

---------

Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
hipudding added a commit to hipudding/ray that referenced this pull request Mar 18, 2025
Signed-off-by: hipudding <huafengchun@gmail.com>
jaychia pushed a commit to jaychia/ray that referenced this pull request Mar 19, 2025
…project#50778)

Changing the default tensor serialization in compiled graphs. Also added
a comprehensive set of unit tests covering cases for torch.Tensor
serialization in both Ray core and compiled graphs.

## Related issue number

Related to issues:
  - ray-project#50134
  - ray-project#50452
Also related to ray-project#47742

---------

Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>
jaychia pushed a commit to jaychia/ray that referenced this pull request Mar 19, 2025
…project#50778)

Changing the default tensor serialization in compiled graphs. Also added
a comprehensive set of unit tests covering cases for torch.Tensor
serialization in both Ray core and compiled graphs.

## Related issue number

Related to issues:
  - ray-project#50134
  - ray-project#50452
Also related to ray-project#47742

---------

Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>
Drice1999 pushed a commit to Drice1999/ray that referenced this pull request Mar 23, 2025
…project#50778)

Changing the default tensor serialization in compiled graphs. Also added
a comprehensive set of unit tests covering cases for torch.Tensor
serialization in both Ray core and compiled graphs.

## Related issue number

Related to issues:
  - ray-project#50134
  - ray-project#50452
Also related to ray-project#47742

---------

Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
dhakshin32 pushed a commit to dhakshin32/ray that referenced this pull request Mar 27, 2025
…project#50778)

Changing the default tensor serialization in compiled graphs. Also added
a comprehensive set of unit tests covering cases for torch.Tensor
serialization in both Ray core and compiled graphs.

## Related issue number

Related to issues:
  - ray-project#50134
  - ray-project#50452
Also related to ray-project#47742

---------

Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants