From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function
Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn

TL;DR
This paper reveals that language models trained with DPO can be understood as Q-functions within a token-level MDP framework, providing new theoretical insights and practical improvements for reinforcement learning from human feedback.
Contribution
It derives DPO as an inverse Q-learning algorithm in token-level MDPs, bridging the gap between RLHF and bandit approaches, and offers empirical insights into its applications.
Findings
DPO performs token-level credit assignment.
MCTS is equivalent to likelihood-based search on DPO policies.
Choice of reference policy affects implicit rewards during training.
Abstract
Reinforcement Learning From Human Feedback (RLHF) has been critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference. We theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. Using our theoretical results, we provide three concrete empirical insights. First, we show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression
MethodsDirect Preference Optimization · Balanced Selection · Q-Learning
