Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon,, Christopher D. Manning, Chelsea Finn

TL;DR
This paper introduces Direct Preference Optimization (DPO), a simple, stable, and effective method for aligning language models with human preferences without complex reinforcement learning procedures.
Contribution
DPO provides a new parameterization of the reward model enabling closed-form optimal policy extraction, simplifying the fine-tuning process of language models based on human preferences.
Findings
DPO matches or exceeds RLHF in aligning models with human preferences.
DPO improves control over sentiment in generated text.
DPO is simpler and more computationally efficient than existing methods.
Abstract
While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗aaditya/Llama3-OpenBioLLM-8Bmodel· 39k dl· ♡ 23639k dl♡ 236
- 🤗aaditya/Llama3-OpenBioLLM-70Bmodel· 3.6k dl· ♡ 5033.6k dl♡ 503
- 🤗LiteLLMs/Llama3-OpenBioLLM-8B-GGUFmodel· 34 dl· ♡ 134 dl♡ 1
- 🤗ChiKoi7/stablelm-zephyr-3b-Heretic-GGUFmodel· 170 dl· ♡ 2170 dl♡ 2
- 🤗Gxl/sdamodel
- 🤗lomahony/eleuther-pythia70m-hh-dpomodel· 17 dl17 dl
- 🤗lomahony/eleuther-pythia160m-hh-dpomodel· 16 dl16 dl
- 🤗lomahony/eleuther-pythia410m-hh-dpomodel· 12 dl12 dl
- 🤗Leogrin/eleuther-pythia1b-hh-dpomodel· 4 dl· ♡ 14 dl♡ 1
- 🤗Leogrin/eleuther-pythia1.4b-hh-dpomodel· 18 dl· ♡ 118 dl♡ 1
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsDirect Preference Optimization · ALIGN
