Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both
Abhijnan Nath, Changsoo Jung, Ethan Seefried, Nikhil Krishnaswamy

TL;DR
This paper introduces DRDO, a novel method that simultaneously distills rewards and learns preferences for language models, improving robustness and performance over existing methods like DPO, especially under noisy or OOD conditions.
Contribution
DRDO is the first approach to jointly model rewards and preferences, addressing degeneracy issues and enhancing robustness in language model alignment.
Findings
DRDO outperforms DPO and e-DPO in expected rewards.
DRDO is more robust to noisy preference signals.
DRDO maintains performance in out-of-distribution scenarios.
Abstract
Traditional RLHF-based LLM alignment methods explicitly maximize the expected rewards from a separate reward model. More recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems including model drift and reward overfitting. Although popular due to its simplicity, DPO and similar direct alignment methods which rely heavily on the Bradley-Terry-based pairwise preference formulation can still lead to degenerate policies when challenged by non-deterministic or noisy preference labels, for example human scoring of two candidate outputs with low confidence. This paper introduces DRDO (Direct Reward Distillation and policy-Optimization), which simultaneously models rewards and preferences to avoid such degeneracy. DRDO directly mimics rewards assigned by an oracle while learning human preferences with a novel preference likelihood…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Organizational Management and Leadership
MethodsDirect Preference Optimization
