Multi GPU training on g5.12xlarge #4412
-
Hi all, I can hack my way using pytorch\accelerate but I wonder why there's no straightforward way to do it with the SM SDK? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hello, I would suggest trying the
|
Beta Was this translation helpful? Give feedback.
Hello, I would suggest trying the
torch_distributed
distribution in the estimator. This uses torchrun under the hood and should work on any multi-gpu instance type. And you can make sure to set the distributed backend to "nccl" in your training script. This will leverage DDP training without the instance type constraints of SMDDP