Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PPO readability #210

Merged
merged 12 commits into from
Feb 5, 2023
Merged

Conversation

alan-cooney
Copy link
Contributor

@alan-cooney alan-cooney commented Jan 22, 2023

Adds mostly comments and renames a few variables to be more descriptive. Also deletes two unused methods.

I've run an example with wandb to check this doesn't change anything unexpectedly - https://wandb.ai/alancooney/trlx/runs/aew4nu69 (same as before the changes)

Co-authored-by: @jezgillen

@alan-cooney alan-cooney changed the title Add comments to PPO code Improve PPO readability Jan 22, 2023
@alan-cooney
Copy link
Contributor Author

@reciprocated to help with documenting this code, are you able to confirm what for _ in range(self.n_updates_per_batch): is intended to do in the base trainer?

for _ in range(self.config.train.epochs):
for batch in self.train_dataloader:
for _ in range(self.n_updates_per_batch):
forward_time = time()

Git blame suggests this is your code. My understanding is it runs the same batch multiple times - is that as expected? That is to say it lets the user repeat a batch, rather than e.g. having multiple steps e.g. turns in a text-based game (which I don't think the library supports).

@maxreciprocate
Copy link
Collaborator

Hey there! n_updates_per_batch let's perform multiple gradient updates on the same batch of data, this is a common PPO implementation detail. Here is an excerpt from PPO's paper abstract [1] as reference:

Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates

[1] https://arxiv.org/abs/1707.06347

@alan-cooney
Copy link
Contributor Author

Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates

Awesome thanks for the quick reply! I'll add this in to the comments

@alan-cooney alan-cooney marked this pull request as ready for review January 23, 2023 09:56
@alan-cooney alan-cooney marked this pull request as draft January 23, 2023 19:15
@alan-cooney alan-cooney marked this pull request as ready for review January 23, 2023 20:34
Copy link
Collaborator

@maxreciprocate maxreciprocate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splendid, what's left is to merge changes from the main and do a pre-commit run and I've also left a small suggestion to remain consistent in naming throughout. Thanks for the effort @alan-cooney @jezgillen

@LouisCastricato
Copy link
Contributor

Let's get this merged

@alan-cooney
Copy link
Contributor Author

Thanks for the review @reciprocated

Note I've had to resolve a tonne of merge conflicts, so it's worth checking those are in line with expectations (they're primarily with your commits).

Given the size of this PR please can you take a look as soon as possible & merge, otherwise we'll likely get other merge conflicts as well.

Copy link
Collaborator

@maxreciprocate maxreciprocate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxreciprocate maxreciprocate merged commit c8aeb0e into CarperAI:main Feb 5, 2023
@alan-cooney alan-cooney deleted the commentsCore branch February 5, 2023 18:19
@alan-cooney
Copy link
Contributor Author

Thanks for the quick review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants