Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei,, Guangju Wang, Chao Yu, Yi Wu

TL;DR
This study compares DPO and PPO in aligning large language models with human preferences, revealing PPO's superior performance across various benchmarks and its ability to achieve state-of-the-art results.
Contribution
The paper provides a comprehensive analysis of DPO and PPO, highlighting PPO's advantages and establishing its effectiveness in LLM alignment tasks.
Findings
PPO outperforms DPO in multiple RLHF benchmarks.
PPO achieves state-of-the-art results in code generation tasks.
Theoretical limitations of DPO are identified.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security
MethodsDirect Preference Optimization · Entropy Regularization · ALIGN · Proximal Policy Optimization
