DPO Meets PPO: Reinforced Token Optimization for RLHF

Han Zhong; Zikang Shan; Guhao Feng; Wei Xiong; Xinle Cheng; Li Zhao; Di He; Jiang Bian; Liwei Wang

arXiv:2404.18922·cs.LG·May 22, 2025

DPO Meets PPO: Reinforced Token Optimization for RLHF

Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, Liwei Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Reinforced Token Optimization (RTO), a new framework that models RLHF as an MDP to learn token-wise rewards, improving policy training for language models over traditional PPO methods.

Contribution

The paper proposes RTO, combining DPO and PPO, to learn token-wise rewards from preference data, enabling more efficient and fine-grained policy optimization in RLHF.

Findings

01

RTO outperforms PPO by 7.5 points on AlpacaEval 2.

02

RTO outperforms other preference learning algorithms.

03

Theoretically, RTO finds near-optimal policies sample-efficiently.

Abstract

In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of large language models, its open-source implementation is still largely sub-optimal. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Under this framework, we introduce an algorithm Reinforced Token Optimization (\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \texttt{RTO} is proven to have the capability of finding the near-optimal policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zkshan2002/rto
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReal-Time Systems Scheduling · Real-time simulation and control systems

MethodsDirect Preference Optimization · Entropy Regularization · Proximal Policy Optimization