Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Shusheng Xu; Wei Fu; Jiaxuan Gao; Wenjie Ye; Weilin Liu; Zhiyu Mei,; Guangju Wang; Chao Yu; Yi Wu

arXiv:2404.10719·cs.CL·October 11, 2024·6 cites

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei,, Guangju Wang, Chao Yu, Yi Wu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This study compares DPO and PPO in aligning large language models with human preferences, revealing PPO's superior performance across various benchmarks and its ability to achieve state-of-the-art results.

Contribution

The paper provides a comprehensive analysis of DPO and PPO, highlighting PPO's advantages and establishing its effectiveness in LLM alignment tasks.

Findings

01

PPO outperforms DPO in multiple RLHF benchmarks.

02

PPO achieves state-of-the-art results in code generation tasks.

03

Theoretical limitations of DPO are identified.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openpsi-project/realhf
pytorchOfficial

Datasets

anakin87/gemma-vs-gemma-preferences
dataset· 11 dl
11 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security

MethodsDirect Preference Optimization · Entropy Regularization · ALIGN · Proximal Policy Optimization