Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(base_trainer): gather weights in save_pretrained under zero3 #429

Merged
merged 4 commits into from
Apr 13, 2023

Conversation

maxreciprocate
Copy link
Collaborator

@maxreciprocate maxreciprocate commented Apr 10, 2023

This PR lets users optionally choose whether to save optimizer state on each and a final checkpoint or only the model weights via trainer.save_pretrained (new default). Also this PR removes two back to back calls to trainer.evaluate, which would happen on the same iter_count in the case total_steps is divisible by config.train.eval_interval. And it fixes trainer.save_pretrained, in which under zero3, weights weren't gathered previously into the state_dict.

https://wandb.ai/sorry/trlx-references/reports/fix-save-pretrained-zero3-v-main--Vmlldzo0MDI1ODk1

@maxreciprocate maxreciprocate requested a review from jon-tow April 10, 2023 16:18
Copy link
Collaborator

@jon-tow jon-tow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the gather fix and for catching those final step conditions (self.iter_count >= self.total_steps)!

One issue with making save_optimizer = False the default is that you can't resume training ( even though proper checkpointing from say deepspeed ZeRO states is currently WIP). Luckily, we haven't trained on very large datasets (> 10M samples) for this to be much of an issue. Do you foresee this being an issue? I sort of feel like the expected behavior is to save optimizer states.

@maxreciprocate
Copy link
Collaborator Author

Yeah, I also think the excepted behaviour is to save optimizer states, but as you said typical relevant datasets are well below 10M samples and resuming training is only meaningful upon heavy preemptions (at least I haven't had a need to resume training in a while). Cons: training or sweeps would overtake only a fraction of storage (plausibly helpful for end-users with small drives who run many different experiments), but now that coreweave quotas seem to be a bit relaxed (I used to delete .cache on hdd twice a day), I will invert the default as you suggested

Copy link
Collaborator

@jon-tow jon-tow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @reciprocated!

@jon-tow jon-tow merged commit 2318d04 into main Apr 13, 2023
@maxreciprocate maxreciprocate deleted the fix-save-pretrained-zero3 branch April 13, 2023 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants