On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization
Yong Lin, Skyler Seto, Maartje ter Hoeve, Katherine Metcalf,, Barry-John Theobald, Xuan Wang, Yizhe Zhang, Chen Huang, Tong Zhang

TL;DR
This paper investigates the generalization limitations of the implicit reward model derived from Direct Preference Optimization in reinforcement learning from human feedback, revealing its poorer out-of-distribution performance compared to explicit reward models.
Contribution
It provides empirical evidence that DPORM has limited generalization ability, especially under distribution shifts, supporting the use of explicit reward models in iterative DPO methods.
Findings
DPORM fits training data well but generalizes less effectively.
DPORM's accuracy drops by 3-7% in out-of-distribution settings.
Explicit reward models outperform implicit models under distribution shifts.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown that the implicit reward model of DPO (denoted as DPORM) can approximate an EXRM in the limit. DPORM's effectiveness directly implies the optimality of the learned policy, and also has practical implication for LLM alignment methods including iterative DPO. However, it is unclear how well DPORM empirically matches the performance of EXRM. This work studies the accuracy at distinguishing preferred and rejected answers for both DPORM and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMulti-Criteria Decision Making
MethodsDirect Preference Optimization
