Skip to content

Multi GPU training on g5.12xlarge #4412

Answered by ruhanprasad
H3zi asked this question in Help
Discussion options

You must be logged in to vote

Hello, I would suggest trying the torch_distributed distribution in the estimator. This uses torchrun under the hood and should work on any multi-gpu instance type. And you can make sure to set the distributed backend to "nccl" in your training script. This will leverage DDP training without the instance type constraints of SMDDP

distribution={
        "torch_distributed": {
            "enabled": True
        }
    }

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by ruhanprasad
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Help
Labels
None yet
2 participants