-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Changing default tensor serialization in compiled graphs #50778
Conversation
ContextThese tests clarify the difference between Ray core and complied graphs in transferring
Test scope
Issues
Proposal
|
10db8c6
to
f0f6e82
Compare
@anmscale I know that CG doesn't support multiple devices per actor atm, but we should also consider that behavior because:
Could you add this to the proposal? (including current core behavior -- I know we discussed it somewhere but now I don't remember...) |
b10a6ff
to
5bfa90e
Compare
Updated the proposal and added a test. Note that I'm not sure what's the context for not supporting multiple-gpus in compiled graph actors. It might not be a high priority issue though because we usually assign a single GPU per actor. |
08b031e
to
bb919bb
Compare
cc @edoakes @kevin85421 for reviews |
I spoke with @stephanie-wang and she prefers that I merge this PR with the one I planned next for the code change. Will update here shortly |
Just double checking you are not blocked atm |
Just pushed the new behavior in a new commit. Still fixing some unit tests |
77ee2b2
to
604c188
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the device_policy
inherited across multiple calls in the DAG or does it only apply to one edge? If it only applies to one edge, then defining an alternative policy does not provide any utility -- it would be equally concise and expressive to require the user to explicitly specify the target device instead (target_device="cuda"
instead of device_policy="default"
).
python/ray/dag/dag_node.py
Outdated
@@ -141,17 +142,20 @@ def _collect_upstream_nodes(self) -> List["DAGNode"]: | |||
def with_tensor_transport( | |||
self, | |||
transport: Optional[Union[str, Communicator]] = "auto", | |||
device_policy: Optional[Literal["auto", "default"]] = "auto", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's exceptionally confusing that "default" is not the default 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps it would be more clear to call it "default_device"
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, this is a better option yet a little too long. Will change to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use device
instead of device_policy
. It is shorter and conveys the same meaning.
Here is another suggestion for the options:
- "retain": Default option. Tensors always moved to a device on the receiver that matches the original device on the sender, if such a device is visible to the receiver. Also, I would make this the default option.
- "auto": Tensors always moved to receiver's default device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Maybe we should in the docstring what default device is, i.e. the torch_device
defined in ChannelContext
@@ -22,6 +22,7 @@ class TorchTensorType(ChannelOutputType): | |||
def __init__( | |||
self, | |||
transport: Optional[Union[str, Communicator]] = AUTO, | |||
device_policy: Literal["auto", "default"] = "auto", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be defined as a string Enum (which can then easily be publicly documented and extended)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will change that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this apply to transport
as well? I think we can change the internal API to use DevicePolicy but keep the user facing one (e.g. with_tensor_transport
) use a string value. Makes sense?
here's my thinking: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think using the following options would lead to better default behavior:
- "auto": (default) If a tensor matches its sender's default device, then the receiver will deserialize the tensor to its default device. All other tensors will be deserialized to an equivalent device on the receiver, and an error is thrown if the device doesn't exist.
- "retain": Tensors will be deserialized to an equivalent device on the receiver, and an error is thrown if the device doesn't exist.
- "cpu"
- "cuda" and/or "gpu"
The main reason is for driver <> actor tensors. I think it will be common for the driver to want to read/write some data from the actors, and it may not have a GPU available. In that case, the current default in this PR will throw an error.
python/ray/dag/dag_node.py
Outdated
@@ -141,17 +142,20 @@ def _collect_upstream_nodes(self) -> List["DAGNode"]: | |||
def with_tensor_transport( | |||
self, | |||
transport: Optional[Union[str, Communicator]] = "auto", | |||
device_policy: Optional[Literal["auto", "default"]] = "auto", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use device
instead of device_policy
. It is shorter and conveys the same meaning.
Here is another suggestion for the options:
- "retain": Default option. Tensors always moved to a device on the receiver that matches the original device on the sender, if such a device is visible to the receiver. Also, I would make this the default option.
- "auto": Tensors always moved to receiver's default device.
624e001
to
6996452
Compare
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
4b1a82b
to
745f08b
Compare
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
aadebad
to
49e8d7e
Compare
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we also need to test:
- input / output type hints of the same DAG node are different
- different input type hints for the same DAG node
python/ray/dag/tests/experimental/test_torch_tensor_transport.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
c5f17bd
to
3ffad24
Compare
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
76f7bba
to
0153d8a
Compare
python/ray/dag/tests/experimental/test_torch_tensor_transport.py
Outdated
Show resolved
Hide resolved
python/ray/dag/tests/experimental/test_torch_tensor_transport.py
Outdated
Show resolved
Hide resolved
python/ray/dag/tests/experimental/test_torch_tensor_transport.py
Outdated
Show resolved
Hide resolved
python/ray/dag/tests/experimental/test_torch_tensor_transport.py
Outdated
Show resolved
Hide resolved
Fix unit test issues Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Signed-off-by: Amjad Almahairi <anm@anyscale.com>
Signed-off-by: hipudding <huafengchun@gmail.com>
Signed-off-by: hipudding <huafengchun@gmail.com>
Signed-off-by: hipudding <huafengchun@gmail.com>
…project#50778) Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs. ## Related issue number Related to issues: - ray-project#50134 - ray-project#50452 Also related to ray-project#47742 --------- Signed-off-by: Amjad Almahairi <anm@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: hipudding <huafengchun@gmail.com>
…project#50778) Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs. ## Related issue number Related to issues: - ray-project#50134 - ray-project#50452 Also related to ray-project#47742 --------- Signed-off-by: Amjad Almahairi <anm@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>
…project#50778) Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs. ## Related issue number Related to issues: - ray-project#50134 - ray-project#50452 Also related to ray-project#47742 --------- Signed-off-by: Amjad Almahairi <anm@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Jay Chia <17691182+jaychia@users.noreply.github.com>
…project#50778) Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs. ## Related issue number Related to issues: - ray-project#50134 - ray-project#50452 Also related to ray-project#47742 --------- Signed-off-by: Amjad Almahairi <anm@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…project#50778) Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs. ## Related issue number Related to issues: - ray-project#50134 - ray-project#50452 Also related to ray-project#47742 --------- Signed-off-by: Amjad Almahairi <anm@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>
Why are these changes needed?
Changing the default tensor serialization in compiled graphs. Also added a comprehensive set of unit tests covering cases for torch.Tensor serialization in both Ray core and compiled graphs.
Related issue number
Related to issues:
Also related to [core][aDAG] Fix cpu tensor is automatically converted to gpu tensor #47742
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.