Multi GPU training on g5.12xlarge #4412

H3zi · 2024-02-05T13:03:32Z

H3zi
Feb 5, 2024

Hi all,
I wonder if there's a built-in solution in SageMaker to run distributed training on a g5.12xlarge (4xGPUs)?
From what I can see, SMDDP does not support g5.12xlarge instances, so specifying distributed config in the estimator will not work.

I can hack my way using pytorch\accelerate but I wonder why there's no straightforward way to do it with the SM SDK?

Thanks

Answered by ruhanprasad

Apr 24, 2025

Hello, I would suggest trying the torch_distributed distribution in the estimator. This uses torchrun under the hood and should work on any multi-gpu instance type. And you can make sure to set the distributed backend to "nccl" in your training script. This will leverage DDP training without the instance type constraints of SMDDP

distribution={
        "torch_distributed": {
            "enabled": True
        }
    }

View full answer

ruhanprasad · 2025-04-24T21:07:26Z

ruhanprasad
Apr 24, 2025
Collaborator

Hello, I would suggest trying the torch_distributed distribution in the estimator. This uses torchrun under the hood and should work on any multi-gpu instance type. And you can make sure to set the distributed backend to "nccl" in your training script. This will leverage DDP training without the instance type constraints of SMDDP

distribution={
        "torch_distributed": {
            "enabled": True
        }
    }

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU training on g5.12xlarge #4412

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Multi GPU training on g5.12xlarge #4412

H3zi Feb 5, 2024

Replies: 1 comment

ruhanprasad Apr 24, 2025 Collaborator

H3zi
Feb 5, 2024

ruhanprasad
Apr 24, 2025
Collaborator