Length Desensitization in Direct Preference Optimization

Wei Liu; Yang Bai; Chengcheng Han; Rongxiang Weng; Jun Xu; Xuezhi Cao,; Jingang Wang; Xunliang Cai

arXiv:2409.06411·cs.LG·December 2, 2024

Length Desensitization in Direct Preference Optimization

Wei Liu, Yang Bai, Chengcheng Han, Rongxiang Weng, Jun Xu, Xuezhi Cao,, Jingang Wang, Xunliang Cai

PDF

Open Access 3 Reviews

TL;DR

This paper identifies length sensitivity issues in Direct Preference Optimization (DPO) used in RLHF for LLMs, and proposes LD-DPO to improve response conciseness and alignment with human preferences.

Contribution

The paper provides a theoretical analysis of DPO's length bias and introduces LD-DPO, a novel method to reduce length sensitivity and improve response quality.

Findings

01

LD-DPO reduces response length by 10-40% compared to DPO.

02

LD-DPO outperforms baseline methods on multiple benchmarks.

03

Experimental results confirm length desensitization and better alignment with human preferences.

Abstract

Direct Preference Optimization (DPO) is widely utilized in the Reinforcement Learning from Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human preferences, thereby enhancing both their harmlessness and efficacy. However, it has been observed that DPO tends to over-optimize for verbosity, which can detrimentally affect both performance and user experience. In this paper, we conduct an in-depth theoretical analysis of DPO's optimization objective and reveal a strong correlation between its implicit reward and data length. This correlation misguides the optimization direction, resulting in length sensitivity during the DPO training and leading to verbosity. To address this issue, we propose a length-desensitization improvement method for DPO, termed LD-DPO. The proposed method aims to desensitize DPO to data length by decoupling explicit length preference, which is…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

- The logic flow is clear. - The authors identify the reason for length sensitivity in the DPO algorithm. - Based on their analysis, the authors propose the LD-DPO algorithm, which performs well in terms of length control and alignment. - Experiments with three models across two datasets demonstrate the generalizability of LD-DPO. - LD-DPO is a simple yet effective method.

Weaknesses

- Although the authors claim to have theoretically proven the sensitivity of DPO to length, the description is still insufficiently rigorous. For example, from Equation 4 to Equation 5, the expectation sign is omitted without further explanation. - The explanation from lines 211 to 215 is vague and overly intuitive, especially regarding the relationship between length and probability. - In Equation 7, the authors take the absolute value of the ratio of two Jacobians, a less clear motivation that

Reviewer 02Rating 5Confidence 4

Strengths

1. Addresses a popular issue of DPO's sensitivity to length. 2. Good presentation and easy to read. 3. Good empirical performance.

Weaknesses

1. Although theoretical insights on why DPO favors longer response is provided, the proposed LD-DPO is a heuristic method. It directly cuts off the importance of the tokens exceeding the public length. It is disappointing to see the solution to the well-formulated length sensitivity problem is just a code-level heuristic method. Why not try to modify the DPO loss for a loss landscape[1] that is length-desensitized? 2. The description for eq.(10) is not rigorous. Why $p^\alpha$ is "human-like pre

Reviewer 03Rating 5Confidence 4

Strengths

1. Length bias widely exists in a wide range of LLM alignment methods and should be disentangled from real human preferences, 2. Motivation of LD-DPO is clearly expressed by the theoretical analysis. 3. Proposed method is evaluated on multiple benchmarks and base models.

Weaknesses

1. Redundant symbol definitions. I do not think the definitions of $\mathcal{X}_1$, $\mathcal{X}_2$,$\mathcal{K}_1$,$\mathcal{K}_2$ are necessary. It just adds to the diffculty to understanding. 2. The colors in the fig. 3 are difficult to distinguish. And this figure is also a bit hard to comprehend. 3.Some spelling and grammartical mistakes, e.g. "Length **Desentsitization** of DPO, termed LD-DPO"

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Algebra and Logic · Multi-Criteria Decision Making

MethodsDirect Preference Optimization · ALIGN