On the Limited Generalization Capability of the Implicit Reward Model   Induced by Direct Preference Optimization

Yong Lin; Skyler Seto; Maartje ter Hoeve; Katherine Metcalf,; Barry-John Theobald; Xuan Wang; Yizhe Zhang; Chen Huang; Tong Zhang

arXiv:2409.03650·cs.LG·October 4, 2024

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

Yong Lin, Skyler Seto, Maartje ter Hoeve, Katherine Metcalf,, Barry-John Theobald, Xuan Wang, Yizhe Zhang, Chen Huang, Tong Zhang

PDF

Open Access 1 Video

TL;DR

This paper investigates the generalization limitations of the implicit reward model derived from Direct Preference Optimization in reinforcement learning from human feedback, revealing its poorer out-of-distribution performance compared to explicit reward models.

Contribution

It provides empirical evidence that DPORM has limited generalization ability, especially under distribution shifts, supporting the use of explicit reward models in iterative DPO methods.

Findings

01

DPORM fits training data well but generalizes less effectively.

02

DPORM's accuracy drops by 3-7% in out-of-distribution settings.

03

Explicit reward models outperform implicit models under distribution shifts.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown that the implicit reward model of DPO (denoted as DPORM) can approximate an EXRM in the limit. DPORM's effectiveness directly implies the optimality of the learned policy, and also has practical implication for LLM alignment methods including iterative DPO. However, it is unclear how well DPORM empirically matches the performance of EXRM. This work studies the accuracy at distinguishing preferred and rejected answers for both DPORM and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization· underline

Taxonomy

TopicsMulti-Criteria Decision Making

MethodsDirect Preference Optimization