Direct Preference Optimization: Your Language Model is Secretly a Reward   Model

Rafael Rafailov; Archit Sharma; Eric Mitchell; Stefano Ermon,; Christopher D. Manning; Chelsea Finn

arXiv:2305.18290·cs.LG·July 31, 2024·271 cites

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon,, Christopher D. Manning, Chelsea Finn

PDF

Open Access 5 Repos 10 Models 5 Datasets 2 Videos

TL;DR

This paper introduces Direct Preference Optimization (DPO), a simple, stable, and effective method for aligning language models with human preferences without complex reinforcement learning procedures.

Contribution

DPO provides a new parameterization of the reward model enabling closed-form optimal policy extraction, simplifying the fine-tuning process of language models based on human preferences.

Findings

01

DPO matches or exceeds RLHF in aligning models with human preferences.

02

DPO improves control over sentiment in generated text.

03

DPO is simpler and more computationally efficient than existing methods.

Abstract

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained· youtube

Direct Preference Optimization: Your Language Model is Secretly a Reward Model· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsDirect Preference Optimization · ALIGN