3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Yuzi Yan; Yibo Miao; Jialian Li; Yipin Zhang; Jian Xie; Zhijie Deng,; Dong Yan

arXiv:2406.07327·cs.AI·February 10, 2025

3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Yuzi Yan, Yibo Miao, Jialian Li, Yipin Zhang, Jian Xie, Zhijie Deng,, Dong Yan

PDF

Open Access 3 Reviews

TL;DR

This paper analyzes Direct Preference Optimization (DPO) for aligning large language models, identifying core challenges through the 3D properties, and proposes regularization techniques to improve its stability and effectiveness.

Contribution

It provides a theoretical and empirical analysis of DPO's limitations, introduces the 3D properties, and suggests regularization methods to enhance DPO's performance.

Findings

01

DPO exhibits three key properties: drastic response rejection drop, response suppression, and dispersion effect.

02

Optimization dynamics cause instability in DPO, affecting model alignment.

03

Regularization techniques improve DPO's training stability and performance.

Abstract

Aligning large language models (LLMs) with human preferences has gained significant attention, with Proximal Policy Optimization (PPO) as a standard yet computationally expensive method and Direct Preference Optimization (DPO) as a more efficient alternative. While DPO offers simplicity, it remains underutilized in state-of-the-art LLMs, suggesting potential limitations. In this work, we revisit DPO, analyzing its theoretical foundations and empirical performance to bridge this gap. We identify three key properties, termed 3D properties, that emerge from DPO's learning process: Drastic drop in rejected response likelihood, Degradation into response suppression, and Dispersion effect on unseen responses. We show that these issues arise from DPO's optimization dynamics, where the interaction between chosen and rejected response gradients leads to instability. Our findings are supported by…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The paper is well-structured, where toy example can support their claims. The paper offers a balanced mix of theoretical analysis and empirical evidence, which strengthens the claims made about the 3D-properties and their impact on DPO's performance.

Weaknesses

The three observations have been widely studied by previous works. Besides, one of the proposed regularization methods, incorporating an SFT loss into the objective, has been widely used in existing preference learning approaches [1]. This limits the novelty of the paper. Considering that there are many existing methods to solve the DPO problem proposed in this paper, there is a lack of comparison with them, such as [2] and others. Considering the generality of the proposed constraint algorithm,

Reviewer 02Rating 5Confidence 1

Strengths

The topic is interesting for RLHF. The paper introduces effective regularization methods, including adaptive gradient weighting for chosen and rejected responses. The experiments are well-conducted and thorough.

Weaknesses

The study could benefit from using a wider range of LLMs. The experiments can use more datasets except for math. The code is not open source, which may limit reproducibility.

Reviewer 03Rating 8Confidence 3

Strengths

- Significance: The paper addresses a crucial and interesting gap by analyzing the limitations of DPO - Theoretical Analysis and Empirical Validation: The paper provides a theoretical framework alongside empirical results to validate the presence of the 3D-properties in DPO. This combined approach strengthens the findings, offering clear insights into the mechanisms driving DPO’s limitations and supporting the proposed solutions.

Weaknesses

- Presentation: The presentation could be improved to enhance readability. For example, the text size in Figures 2 and 3 is small, and the description of Scenarios 1-4, which is crucial for understanding the on-policy versus off-policy comparison, is currently only detailed in the appendix. Bringing this description to the main text would improve clarity. - Experimental Design for On-Policy vs. Off-Policy Comparison: The on-policy and off-policy experiments rely on different data sources, which

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsManufacturing Process and Optimization

MethodsDirect Preference Optimization