Robust Preference Optimization through Reward Model Distillation

Adam Fisch; Jacob Eisenstein; Vicky Zayats; Alekh Agarwal; Ahmad; Beirami; Chirag Nagpal; Pete Shaw; Jonathan Berant

arXiv:2405.19316·cs.LG·March 4, 2025·1 cites

Robust Preference Optimization through Reward Model Distillation

Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad, Beirami, Chirag Nagpal, Pete Shaw, Jonathan Berant

PDF

Open Access

TL;DR

This paper introduces a distillation-based approach to improve preference optimization in language models, addressing overfitting issues and enhancing robustness to distribution shifts in preference data.

Contribution

It proposes a novel distillation method that aligns implicit rewards with explicit reward models, improving robustness and avoiding overfitting in preference optimization.

Findings

01

Distillation from a family of reward models improves robustness to distribution shift.

02

The method preserves the simplicity of direct preference optimization.

03

Enhanced stability reduces degenerate policies and overfitting.

Abstract

Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, the empirical evidence suggests that DPO typically assigns implicit rewards that overfit, and trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the preferred generations to go to zero. In this work, we analyze this phenomenon and use distillation to get a better proxy for the true preference distribution over generation pairs: we train the LM such that its induced implicit reward, i.e., the scaled log-likelihood ratio of the model to the reference model, matches an explicit reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTechnology and Data Analysis · Multi-Criteria Decision Making · Data Management and Algorithms

MethodsDirect Preference Optimization