Eliminating Biased Length Reliance of Direct Preference Optimization via   Down-Sampled KL Divergence

Junru Lu; Jiazheng Li; Siyu An; Meng Zhao; Yulan He; Di; Yin; Xing Sun

arXiv:2406.10957·cs.CL·December 10, 2024

Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence

Junru Lu, Jiazheng Li, Siyu An, Meng Zhao, Yulan He, Di, Yin, Xing Sun

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper identifies and addresses the length bias in Direct Preference Optimization (DPO) for aligning large language models, proposing a downsampling method called SamPO that reduces verbosity and improves reward accuracy across various benchmarks.

Contribution

The paper reveals the length reliance issue in DPO and introduces SamPO, a novel downsampling technique that mitigates verbosity and enhances alignment performance.

Findings

01

SamPO effectively reduces verbosity in DPO.

02

Experimental results show 5-12% improvements over DPO.

03

Bias in reward estimation is linked to sequence length discrepancies.

Abstract

Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct and robust alignment of Large Language Models (LLMs) with human preferences, offering a more straightforward alternative to the complex Reinforcement Learning from Human Feedback (RLHF). Despite its promising efficacy, DPO faces a notable drawback: "verbosity", a common over-optimization phenomenon also observed in RLHF. While previous studies mainly attributed verbosity to biased labels within the data, we propose that the issue also stems from an inherent algorithmic length reliance in DPO. Specifically, we suggest that the discrepancy between sequence-level Kullback-Leibler (KL) divergences between chosen and rejected sequences, used in DPO, results in overestimated or underestimated rewards due to varying token lengths. Empirically, we utilize datasets with different label lengths to demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lujunru/sampo
pytorchOfficial

Videos

Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence· underline

Taxonomy

TopicsNeural Networks and Applications · Blind Source Separation Techniques · Face and Expression Recognition

MethodsDirect Preference Optimization