Reveal the Mystery of DPO: The Connection between DPO and RL Algorithms

Xuerui Su; Yue Wang; Jinhua Zhu; Mingyang Yi; Feng Xu; Zhiming Ma,; Yuting Liu

arXiv:2502.03095·cs.LG·February 6, 2025

Reveal the Mystery of DPO: The Connection between DPO and RL Algorithms

Xuerui Su, Yue Wang, Jinhua Zhu, Mingyang Yi, Feng Xu, Zhiming Ma,, Yuting Liu

PDF

Open Access

TL;DR

This paper clarifies the relationship between Direct Preference Optimization (DPO) and Reinforcement Learning (RL) algorithms, especially in the context of Large Language Models, by analyzing their loss functions, target distributions, and key components.

Contribution

The paper introduces a unified framework UDRRA that connects DPO and RLHF algorithms, revealing their similarities, differences, and convergence properties.

Findings

01

UDRRA framework unifies DPO and RLHF algorithms

02

DPO's target distribution is clarified within the framework

03

Key components influence DPO's convergence rate

Abstract

With the rapid development of Large Language Models (LLMs), numerous Reinforcement Learning from Human Feedback (RLHF) algorithms have been introduced to improve model safety and alignment with human preferences. These algorithms can be divided into two main frameworks based on whether they require an explicit reward (or value) function for training: actor-critic-based Proximal Policy Optimization (PPO) and alignment-based Direct Preference Optimization (DPO). The mismatch between DPO and PPO, such as DPO's use of a classification loss driven by human-preferred data, has raised confusion about whether DPO should be classified as a Reinforcement Learning (RL) algorithm. To address these ambiguities, we focus on three key aspects related to DPO, RL, and other RLHF algorithms: (1) the construction of the loss function; (2) the target distribution at which the algorithm converges; (3) the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFuzzy Logic and Control Systems · Neural Networks and Applications · AI-based Problem Solving and Planning