This project was intended to familiarize myself with PyTorch DistributedDataParallel (DDP) and FullyShardedDataParallel (FSDP) training. I've tested it with two GPUs on a single node.
Multiple Choice Question Answering: Given a question, classify the answer from a given set of possible answers.
- Models were trained on
RoBERTa-BASE
- Learning Rate was
3e-6
(using RoBERTaMultipleChoice) and1e-5
(using RoBERTaSequenceClassification) with the AdamW optimizer - Number of Epochs were
5
Results are reported on the model having best valid loss across epochs:
- Accuracy using RoBERTaSequenceClassification: 85.60 %
- Accuracy using RoBERTaMultipleChoice: 83.63 %
Note: Due to compute limitations the models were trained with an effective effective batch size of 8 and training time per epoch was 150 minutes. Hence, we can expect much better results on training for more epochs.
- Maximum batch size (per GPU) that was obtainable using DDP: 2
- Maximum batch size (per GPU) that was obtainable using FSDP: 4
- PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
- Advanced Model Training with Fully Sharded Data Parallel - PyTorch
- Getting Started with Fully Sharded Data Parallel - PyTorch
- A Comprehensive Tutorial to Pytorch DistributedDataParallel - Medium
- Getting Started with Distributed Data Parallel - PyTorch
- https://pytorch.org/docs/stable/fsdp.html
- https://pytorch.org/docs/stable/distributed.html
- Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering