Explicit Preference Optimization: No Need for an Implicit Reward Model
Xiangkun Hu, Lemin Kong, Tong He, David Wipf

TL;DR
This paper introduces EXPO, an explicit preference optimization framework for training large language models that avoids the pitfalls of implicit reward reparameterizations used in prior methods like DPO, leading to more transparent and effective preference alignment.
Contribution
The paper proposes EXPO, a novel explicit preference optimization method that eliminates the need for reparameterized implicit rewards, addressing limitations of existing DPO-based approaches.
Findings
EXPO outperforms DPO in preference alignment tasks.
EXPO demonstrates more transparent regularization and avoids counter-intuitive behaviors.
Empirical results validate the theoretical advantages of EXPO over implicit reward methods.
Abstract
The generated responses of large language models (LLMs) are often fine-tuned to human preferences through a process called reinforcement learning from human feedback (RLHF). As RLHF relies on a challenging training sequence, whereby a separate reward model is independently learned and then later applied to LLM policy updates, ongoing research effort has targeted more straightforward alternatives. In this regard, direct preference optimization (DPO) and its many offshoots circumvent the need for a separate reward training step. Instead, through the judicious use of a reparameterization trick that induces an \textit{implicit} reward, DPO and related methods consolidate learning to the minimization of a single loss function. And yet despite demonstrable success in some real-world settings, we prove that DPO-based objectives are nonetheless subject to sub-optimal regularization and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Topic Modeling · Recommender Systems and Techniques
MethodsDirect Preference Optimization
