Improve PPO readability #210

alan-cooney · 2023-01-22T18:23:59Z

Adds mostly comments and renames a few variables to be more descriptive. Also deletes two unused methods.

I've run an example with wandb to check this doesn't change anything unexpectedly - https://wandb.ai/alancooney/trlx/runs/aew4nu69 (same as before the changes)

Co-authored-by: @jezgillen

alan-cooney · 2023-01-22T21:09:54Z

@reciprocated to help with documenting this code, are you able to confirm what for _ in range(self.n_updates_per_batch): is intended to do in the base trainer?

trlx/trlx/trainer/accelerate_base_trainer.py

Lines 445 to 449 in 84a0711

    
           for _ in range(self.config.train.epochs): 
        
               for batch in self.train_dataloader: 
        
                   for _ in range(self.n_updates_per_batch): 
        
                       forward_time = time()

Git blame suggests this is your code. My understanding is it runs the same batch multiple times - is that as expected? That is to say it lets the user repeat a batch, rather than e.g. having multiple steps e.g. turns in a text-based game (which I don't think the library supports).

maxreciprocate · 2023-01-22T23:37:59Z

Hey there! n_updates_per_batch let's perform multiple gradient updates on the same batch of data, this is a common PPO implementation detail. Here is an excerpt from PPO's paper abstract [1] as reference:

Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates

[1] https://arxiv.org/abs/1707.06347

alan-cooney · 2023-01-23T09:15:51Z

Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates

Awesome thanks for the quick reply! I'll add this in to the comments

maxreciprocate

Splendid, what's left is to merge changes from the main and do a pre-commit run and I've also left a small suggestion to remain consistent in naming throughout. Thanks for the effort @alan-cooney @jezgillen

trlx/orchestrator/ppo_orchestrator.py

trlx/trainer/accelerate_ppo_trainer.py

LouisCastricato · 2023-02-04T11:49:10Z

Let's get this merged

alan-cooney · 2023-02-05T10:21:39Z

Thanks for the review @reciprocated

Note I've had to resolve a tonne of merge conflicts, so it's worth checking those are in line with expectations (they're primarily with your commits).

Given the size of this PR please can you take a look as soon as possible & merge, otherwise we'll likely get other merge conflicts as well.

maxreciprocate

Marvellous, merging!

https://wandb.ai/sorry/trlx/reports/Improve-PPO-readability-210---VmlldzozNDg3ODEx

alan-cooney · 2023-02-05T18:19:27Z

Thanks for the quick review!

Add comments to PPO code

c97800c

alan-cooney changed the title ~~Add comments to PPO code~~ Improve PPO readability Jan 22, 2023

alan-cooney added 2 commits January 22, 2023 20:16

Add additional comments and remove the unused get_model_inputs method

a230eac

Simplify comments

cd3b0b1

Add n_updates_per_batch explanation

e845ac7

alan-cooney marked this pull request as ready for review January 23, 2023 09:56

alan-cooney marked this pull request as draft January 23, 2023 19:15

Add extra comments on the orchestrator

eebfed6

alan-cooney marked this pull request as ready for review January 23, 2023 20:34

alan-cooney added 2 commits January 26, 2023 10:15

Make wording consistent

3bf759c

Fix typo

a22452e

cat-state requested a review from maxreciprocate February 2, 2023 15:52

maxreciprocate requested changes Feb 2, 2023

View reviewed changes

trlx/orchestrator/ppo_orchestrator.py Outdated Show resolved Hide resolved

trlx/orchestrator/ppo_orchestrator.py Outdated Show resolved Hide resolved

trlx/trainer/accelerate_ppo_trainer.py Outdated Show resolved Hide resolved

alan-cooney added 2 commits February 5, 2023 09:01

Rename query as prompt

f3ef9f9

Rename padded_outputs to sample_outputs

5866b97

alan-cooney force-pushed the commentsCore branch from 5f67d72 to 5866b97 Compare February 5, 2023 09:11

alan-cooney added 3 commits February 5, 2023 10:06

Merge branch 'main' into commentsCore

10eabd8

Run pre-commit

c1872d9

Fix rewards bug

96b3d79

alan-cooney force-pushed the commentsCore branch from 03a7f96 to 96b3d79 Compare February 5, 2023 10:13

maxreciprocate approved these changes Feb 5, 2023

View reviewed changes

maxreciprocate merged commit c8aeb0e into CarperAI:main Feb 5, 2023

alan-cooney deleted the commentsCore branch February 5, 2023 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PPO readability #210

Improve PPO readability #210

alan-cooney commented Jan 22, 2023 •

edited

Loading

alan-cooney commented Jan 22, 2023

maxreciprocate commented Jan 22, 2023

alan-cooney commented Jan 23, 2023

maxreciprocate left a comment

LouisCastricato commented Feb 4, 2023

alan-cooney commented Feb 5, 2023

maxreciprocate left a comment

alan-cooney commented Feb 5, 2023

Improve PPO readability #210

Improve PPO readability #210

Conversation

alan-cooney commented Jan 22, 2023 • edited Loading

alan-cooney commented Jan 22, 2023

maxreciprocate commented Jan 22, 2023

alan-cooney commented Jan 23, 2023

maxreciprocate left a comment

Choose a reason for hiding this comment

LouisCastricato commented Feb 4, 2023

alan-cooney commented Feb 5, 2023

maxreciprocate left a comment

Choose a reason for hiding this comment

alan-cooney commented Feb 5, 2023

alan-cooney commented Jan 22, 2023 •

edited

Loading