Robust Preference Optimization through Reward Model Distillation
Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad, Beirami, Chirag Nagpal, Pete Shaw, Jonathan Berant

TL;DR
This paper introduces a distillation-based approach to improve preference optimization in language models, addressing overfitting issues and enhancing robustness to distribution shifts in preference data.
Contribution
It proposes a novel distillation method that aligns implicit rewards with explicit reward models, improving robustness and avoiding overfitting in preference optimization.
Findings
Distillation from a family of reward models improves robustness to distribution shift.
The method preserves the simplicity of direct preference optimization.
Enhanced stability reduces degenerate policies and overfitting.
Abstract
Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, the empirical evidence suggests that DPO typically assigns implicit rewards that overfit, and trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the preferred generations to go to zero. In this work, we analyze this phenomenon and use distillation to get a better proxy for the true preference distribution over generation pairs: we train the LM such that its induced implicit reward, i.e., the scaled log-likelihood ratio of the model to the reference model, matches an explicit reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTechnology and Data Analysis · Multi-Criteria Decision Making · Data Management and Algorithms
MethodsDirect Preference Optimization
