Simultaneous Reward Distillation and Preference Learning: Get You a   Language Model Who Can Do Both

Abhijnan Nath; Changsoo Jung; Ethan Seefried; Nikhil Krishnaswamy

arXiv:2410.08458·cs.LG·February 3, 2025

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

Abhijnan Nath, Changsoo Jung, Ethan Seefried, Nikhil Krishnaswamy

PDF

Open Access

TL;DR

This paper introduces DRDO, a novel method that simultaneously distills rewards and learns preferences for language models, improving robustness and performance over existing methods like DPO, especially under noisy or OOD conditions.

Contribution

DRDO is the first approach to jointly model rewards and preferences, addressing degeneracy issues and enhancing robustness in language model alignment.

Findings

01

DRDO outperforms DPO and e-DPO in expected rewards.

02

DRDO is more robust to noisy preference signals.

03

DRDO maintains performance in out-of-distribution scenarios.

Abstract

Traditional RLHF-based LLM alignment methods explicitly maximize the expected rewards from a separate reward model. More recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems including model drift and reward overfitting. Although popular due to its simplicity, DPO and similar direct alignment methods which rely heavily on the Bradley-Terry-based pairwise preference formulation can still lead to degenerate policies when challenged by non-deterministic or noisy preference labels, for example human scoring of two candidate outputs with low confidence. This paper introduces DRDO (Direct Reward Distillation and policy-Optimization), which simultaneously models rewards and preferences to avoid such degeneracy. DRDO directly mimics rewards assigned by an oracle while learning human preferences with a novel preference likelihood…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Organizational Management and Leadership

MethodsDirect Preference Optimization