Reveal the Mystery of DPO: The Connection between DPO and RL Algorithms
Xuerui Su, Yue Wang, Jinhua Zhu, Mingyang Yi, Feng Xu, Zhiming Ma,, Yuting Liu

TL;DR
This paper clarifies the relationship between Direct Preference Optimization (DPO) and Reinforcement Learning (RL) algorithms, especially in the context of Large Language Models, by analyzing their loss functions, target distributions, and key components.
Contribution
The paper introduces a unified framework UDRRA that connects DPO and RLHF algorithms, revealing their similarities, differences, and convergence properties.
Findings
UDRRA framework unifies DPO and RLHF algorithms
DPO's target distribution is clarified within the framework
Key components influence DPO's convergence rate
Abstract
With the rapid development of Large Language Models (LLMs), numerous Reinforcement Learning from Human Feedback (RLHF) algorithms have been introduced to improve model safety and alignment with human preferences. These algorithms can be divided into two main frameworks based on whether they require an explicit reward (or value) function for training: actor-critic-based Proximal Policy Optimization (PPO) and alignment-based Direct Preference Optimization (DPO). The mismatch between DPO and PPO, such as DPO's use of a classification loss driven by human-preferred data, has raised confusion about whether DPO should be classified as a Reinforcement Learning (RL) algorithm. To address these ambiguities, we focus on three key aspects related to DPO, RL, and other RLHF algorithms: (1) the construction of the loss function; (2) the target distribution at which the algorithm converges; (3) the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFuzzy Logic and Control Systems · Neural Networks and Applications · AI-based Problem Solving and Planning
