Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence
Junru Lu, Jiazheng Li, Siyu An, Meng Zhao, Yulan He, Di, Yin, Xing Sun

TL;DR
This paper identifies and addresses the length bias in Direct Preference Optimization (DPO) for aligning large language models, proposing a downsampling method called SamPO that reduces verbosity and improves reward accuracy across various benchmarks.
Contribution
The paper reveals the length reliance issue in DPO and introduces SamPO, a novel downsampling technique that mitigates verbosity and enhances alignment performance.
Findings
SamPO effectively reduces verbosity in DPO.
Experimental results show 5-12% improvements over DPO.
Bias in reward estimation is linked to sequence length discrepancies.
Abstract
Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct and robust alignment of Large Language Models (LLMs) with human preferences, offering a more straightforward alternative to the complex Reinforcement Learning from Human Feedback (RLHF). Despite its promising efficacy, DPO faces a notable drawback: "verbosity", a common over-optimization phenomenon also observed in RLHF. While previous studies mainly attributed verbosity to biased labels within the data, we propose that the issue also stems from an inherent algorithmic length reliance in DPO. Specifically, we suggest that the discrepancy between sequence-level Kullback-Leibler (KL) divergences between chosen and rejected sequences, used in DPO, results in overestimated or underestimated rewards due to varying token lengths. Empirically, we utilize datasets with different label lengths to demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications · Blind Source Separation Techniques · Face and Expression Recognition
MethodsDirect Preference Optimization
