DPO-Shift: Shifting the Distribution of Direct Preference Optimization
Xiliang Yang, Feng Jiang, Qianen Zhang, Lei Zhao, Xiao Li

TL;DR
This paper introduces DPO-Shift, a method to control the distribution of chosen response probabilities in preference optimization, addressing likelihood displacement and improving alignment with human preferences.
Contribution
DPO-Shift provides a simple, theoretically grounded approach to mitigate likelihood displacement in preference optimization, with demonstrated improvements on downstream tasks.
Findings
DPO-Shift effectively shifts the chosen probability distribution.
There is a fundamental trade-off between chosen probability and reward margin.
DPO-Shift outperforms standard DPO on downstream benchmarks.
Abstract
Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce DPO-Shift to controllably shift the distribution of the chosen probability. Then, we show that DPO-Shift exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of DPO-Shift over DPO on downstream tasks such as MT-Bench and a designed win…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗NoManDeRY/DPO-Shift-Qwen-2-7B-Ultrafeedback-fixed-1.0model· 3 dl3 dl
- 🤗NoManDeRY/DPO-Shift-Qwen-2-7B-UltraChat200K-SFTmodel· 2 dl2 dl
- 🤗NoManDeRY/DPO-Shift-Llama-3-8B-Ultrafeedback-fixed-0.95model· 4 dl4 dl
- 🤗NoManDeRY/DPO-Shift-Llama-3-8B-Ultrafeedback-fixed-1.0model· 4 dl4 dl
- 🤗NoManDeRY/DPO-Shift-Qwen-2-7B-Ultrafeedback-fixed-0.95model· 1 dl1 dl
- 🤗NoManDeRY/DPO-Shift-Llama-3-8B-Ultrafeedback-decrease_linear-1.0to0.95model· 6 dl6 dl
- 🤗NoManDeRY/DPO-Shift-Llama-3-8B-Ultrafeedback-increase_linear_0.95to1.0model· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization
MethodsDirect Preference Optimization
