fix(ppo_trainer): compute mean KL sequence-wise #441

maxreciprocate · 2023-04-19T12:30:34Z

This PR fixes #438, specifically:

Mean KL, which used for statistics and AdaptiveKLController, is now calculated sequence-wise and not token-wise
Statistics are now averaged across rollouts, instead of just taking them only from the last one
exp_* logging variables are renamed to rollout_*, to not confuse them with sqrt_*

https://wandb.ai/sorry/trlx/reports/Sequence-wise-v-token-wise-mean-KL--Vmlldzo0MTE0NzUy

https://wandb.ai/sorry/trlx-references/reports/fix-kl-computation-v-main--Vmlldzo0MTE3Nzc4

Dahoas · 2023-04-20T13:40:51Z

trlx/trainer/accelerate_ppo_trainer.py

@@ -435,7 +435,7 @@ def make_experience(self, num_rollouts: int = 1024, iter_count: int = 0):  # noq
                start = prompt_tensors.shape[1] - 1

            log_ratio = (logprobs - ref_logprobs) * attention_mask[:, :-1]
-            self.mean_kl = (log_ratio.exp() - 1 - log_ratio).mean().to(device)
+            mean_kl = (log_ratio.exp() - 1 - log_ratio).sum(1).mean().to(device)


Thanks for fixing this. Do you think it makes sense to add an option to the config allowing us to choose between token wise vs. sequence wise kl? I agree having a kl computation invariant to seq length is good to keep around

It's easy enough to log both, but for `AdaptiveKLController's purposes I would just stick with one variant which is seq-wise as in lm-human-preferences, since targets are already copied from there and also the respective paper (Ziegler2019 et al.) has target KL value of 8
https://github.com/openai/lm-human-preferences/blob/ec727fde10f1eafb3177e9b0f41a42142e95a2fd/launch.py#L-129-L131

Dahoas · 2023-04-20T13:41:02Z

trlx/trainer/accelerate_ppo_trainer.py

@@ -470,18 +470,21 @@ def make_experience(self, num_rollouts: int = 1024, iter_count: int = 0):  # noq
                )

                rollout_count += 1
-            exp_time = clock.tick()
+
+            if torch.distributed.is_initialized():


Dahoas · 2023-04-20T13:42:59Z

trlx/trainer/accelerate_ppo_trainer.py

-            torch.distributed.all_reduce(self.mean_kl, torch.distributed.ReduceOp.AVG)
-
-        stats["policy/sqrt_kl"] = torch.sqrt(self.mean_kl).item()
+        stats = {k: sum([xs[k] for xs in accumulated_stats]) / len(accumulated_stats) for k in stats}


Strictly speaking this isn't necessarily correct e.g. if we are recording the max over all local rollouts. However I don't know how to perform the correct reduction without annotating each stat, so this seems fine for now

There are no maxs/mins in this particular dict, but I see what you are saying. Also I guess you're referring to

trlx/trlx/trainer/accelerate_base_trainer.py

Line 544 in 9bc0836

stats = {key: sum([stats[key] for stats in stats_accum]) / self.num_mb for key in stats_accum[0]}

but it's also pretty easy to fix, perhaps in a separate PR

Dahoas

Looks good!

lzy37ld · 2023-04-21T02:04:57Z

Thanks for this great work! May you explain a bit why this is sequence-wise KL? I could not connect 'kl = log_ratio.exp() - 1 - log_ratio' to the real math formulation p(x) logp(x)/q(x).
Gently ping @reciprocated

maxreciprocate · 2023-04-21T11:28:44Z

@lzy37ld It's sequence-wise KL because of the .sum(1) before taking the average, while KL expression itself comes from this blogpost http://joschu.net/blog/kl-approx.html and is commonly used https://github.com/DLR-RM/stable-baselines3/blob/dc09d81f9c07943ddbeac57405d9ae2a31f4d434/stable_baselines3/ppo/ppo.py#L255

lzy37ld · 2023-04-22T01:00:49Z

@reciprocated Thanks! That makes sense! It's really a fantastic implementation!
One more question here: Why here we would subtract a reward(x,y_original) at the reward_fn? When I looked at the paper, I notice that they only focus on R(x, generated_y)

maxreciprocate · 2023-04-22T11:26:32Z

@lzy37ld It's optional normalization (disabled with setting delta_reward to False)

trlx/examples/hh/ppo_hh.py

Line 171 in 6e655a4

delta_reward = True

which @PhungVanDuy has found to work better than passing raw reward

maxreciprocate added 2 commits April 19, 2023 15:18

fix(ppo_trainer): compute mean KL sequence-wise, not token-wise

08e3d90

style: satisfy black

0920350

maxreciprocate requested a review from Dahoas April 19, 2023 16:40

feat(ppo_randomwalks): disable adaptive kl for this example

582fe31

Dahoas reviewed Apr 20, 2023

View reviewed changes

feat(ppo_trainer): log token-wise KL

c37aa8b

Dahoas approved these changes Apr 20, 2023

View reviewed changes

Dahoas merged commit 6e655a4 into main Apr 20, 2023

maxreciprocate deleted the fix-kl-computation branch April 21, 2023 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ppo_trainer): compute mean KL sequence-wise #441

fix(ppo_trainer): compute mean KL sequence-wise #441

maxreciprocate commented Apr 19, 2023 •

edited

Loading

Dahoas Apr 20, 2023

maxreciprocate Apr 20, 2023

Dahoas Apr 20, 2023

Dahoas Apr 20, 2023

maxreciprocate Apr 20, 2023

Dahoas left a comment

lzy37ld commented Apr 21, 2023

maxreciprocate commented Apr 21, 2023

lzy37ld commented Apr 22, 2023

maxreciprocate commented Apr 22, 2023

fix(ppo_trainer): compute mean KL sequence-wise #441

fix(ppo_trainer): compute mean KL sequence-wise #441

Conversation

maxreciprocate commented Apr 19, 2023 • edited Loading

Dahoas Apr 20, 2023

Choose a reason for hiding this comment

maxreciprocate Apr 20, 2023

Choose a reason for hiding this comment

Dahoas Apr 20, 2023

Choose a reason for hiding this comment

Dahoas Apr 20, 2023

Choose a reason for hiding this comment

maxreciprocate Apr 20, 2023

Choose a reason for hiding this comment

Dahoas left a comment

Choose a reason for hiding this comment

lzy37ld commented Apr 21, 2023

maxreciprocate commented Apr 21, 2023

lzy37ld commented Apr 22, 2023

maxreciprocate commented Apr 22, 2023

maxreciprocate commented Apr 19, 2023 •

edited

Loading