Skip to content

hyy-2000/RTO

 
 

Repository files navigation

Reinforced Token Optimization (RTO)

This repository contains the source code for our paper DPO Meets PPO: Reinforced Token Optimization for RLHF.

TL;DR: Based on theoretical insights, we propose Reinforced Token Optimization (RTO), a more sample efficient and effective RLHF algorithm than Proximal Policy Optimization (PPO). RTO outperforms PPO, Direct Preference Optimization (DPO) and other baselines on AlpacaEval 2 and Arena-Hard benchmarks by a large margin.

Illustration of RTO

Model Releases and Evaluation Results

We release all model checkpoints in this Huggingface Repo, which includes

We use the UltraFeedback dataset. All preference learning uses a binarized version, while all reinforcement learning uses a prompt-only version.

We evaluate these models using the popular benchmarks AlpacaEval 2 and Arena-Hard, and report the AlpacaEval 2 (raw win rate version and length-controlled version) and Arena-Hard scores (raw win rate version and style-controlled version) in the following table.

models AE2 LC AE2 WR AH SC AH WR
SFT 13.22 8.58 9.2 8.9
DPO 17.40 12.23 13.2 13.8
R-DPO 18.34 12.03 14.2 14.1
SimPO 25.46 20.20 14.5 15.2
TDPO 20.13 11.97 13.2 12.3
PPO 19.47 12.89 16.2 15.6
RTO 27.00 22.45 20.3 21.4

News

  • [2025.2.12] We updated our paper on arxiv.
  • [2025.2.7] We released our code and models.
  • [2024.4.29] We released our paper on arxiv.

Install Requirements

conda create -n rto python=3.10
conda activate rto
conda install cuda -c nvidia/label/cuda-12.1.0
pip3 install torch==2.4.1 torchvision torchaudio
cd RTO
pip3 install -e .

Training Scripts

We include the training scripts in examples/scripts.

bash examples/scripts/train_rto_llama_8b.sh

This is set for 8xA100 GPUs. You may adjust micro_rollout_batch_size and micro_train_batch_size based on your computation environment.

Hyperparameter Tuning

Reinforcement learning algorithms may be sensitive to hyperparameter tuning. Based on OpenRLHF's well-tuned hyperparameters for PPO, the only additional parameter to tune is $\beta_1$ (dpo_reward_scale in code), the scale of DPO token rewards. Since the main contribution of DPO rewards is reward shaping rather than absolute gains, $\beta_1$ can be safely set to a small value. We recommand using $0.05$ as starting point, but the guideline is not to let DPO token rewards dominate.

Acknowledgement

We would like to thank OpenRLHF for their excellent implementation of RLHF algorithms.

Citation

If you find the content of this repo useful, please consider cite it as follows:

@article{zhong2024dpo,
  title={Dpo meets ppo: Reinforced token optimization for rlhf},
  author={Zhong, Han and Feng, Guhao and Xiong, Wei and Cheng, Xinle and Zhao, Li and He, Di and Bian, Jiang and Wang, Liwei},
  journal={arXiv preprint arXiv:2404.18922},
  year={2024}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Other 0.3%