What Matters in Data for DPO?
Yu Pan, Zhongze Cai, Guanting Chen, Huaiyang Zhong, Chonghuan Wang

TL;DR
This paper systematically investigates how the distribution and quality of preference data affect the performance of Direct Preference Optimization (DPO) in aligning large language models, highlighting the importance of chosen responses.
Contribution
It provides a theoretical and empirical analysis of preference data characteristics, emphasizing the dominant role of chosen response quality in DPO effectiveness.
Findings
Quality of chosen responses significantly impacts DPO performance
Contrastiveness between responses enhances the effectiveness of DPO
Mixing on-policy data can improve alignment outcomes
Abstract
Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning large language models (LLMs) with human preferences, bypassing the need for a learned reward model. Despite its growing adoption, a fundamental question remains open: what characteristics of preference data are most critical for DPO performance? In this work, we provide a systematic study of how preference data distribution influences DPO, from both theoretical and empirical perspectives. We show that the quality of chosen responses plays a dominant role in optimizing the DPO objective, while the quality of rejected responses may have relatively limited impact. Our theoretical analysis characterizes the optimal response distribution under DPO and reveals how contrastiveness between responses helps primarily by improving the chosen samples. We further study an online DPO setting and show it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
