Why DPO is a Misspecified Estimator and How to Fix It

Aditya Gopalan; Sayak Ray Chowdhury; Debangshu Banerjee

arXiv:2510.20413·cs.LG·October 24, 2025

Why DPO is a Misspecified Estimator and How to Fix It

Aditya Gopalan, Sayak Ray Chowdhury, Debangshu Banerjee

PDF

Open Access 3 Reviews

TL;DR

This paper analyzes the limitations of Direct Preference Optimization (DPO) as a misspecified estimator and introduces AuxDPO, an improved method that mitigates these issues for better language model alignment.

Contribution

The paper reveals DPO's misspecification problem and proposes AuxDPO, a novel method with auxiliary variables to improve alignment performance.

Findings

01

AuxDPO outperforms DPO in didactic bandit settings

02

AuxDPO achieves better alignment results on LLM tasks

03

DPO can cause preference reversal and reward degradation

Abstract

Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 4

Strengths

The paper is well written and clearly structured. Provides an insightful theoretical analysis of DPO and RLHF via Taylor approximation, revealing: - the local geometry of DPO under parametric policies, - the local geometry of RLHF optimization, and - the relationship between RLHF equivalence classes and DPO linearization. Proposes a novel and principled solution (AuxDPO) to address the identified misspecification issue.

Weaknesses

Could oversampling or undersampling preference pairs to balance frequencies before DPO training mitigate the misspecification issue and yield comparable performance? There is no discussion or experimental analysis of AuxDPO's sensitivity to its core hyperparameters ($\lambda$ and $n$). How should these values be chosen in practice?

Reviewer 02Rating 8Confidence 2

Strengths

* the paper is mathematically rigorous in demonstrating its claims. * the work does a good job demonstrating why the mis-specification is a problem through a useful example and follow-up points. * the results demonstrate strong performance in both in distribution and out of distribution settings.

Weaknesses

* the derivation of the aux DPO objective makes sense, but is justified using a local approximation. While this makes sense, it could be worth pointing out the the AuxDPO solution (at least to my understanding) holds under these approximations only. * The paper is a bit hard to follow at times as it is particularly dense. I think at various points in the manuscript having more motivation and explanation would be helpful. Why do we want to do a first order approx? What is the meaning of the A ma

Reviewer 03Rating 6Confidence 3

Strengths

This paper proposes an interesting idea to further improve the quality of DPO, based on principled local information geometry. The proposed technique is not only principled but also effective in practice.

Weaknesses

The only downside of the paper is that it is very notation heavy and not easy to follow in the first read.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Recommender Systems and Techniques · Advanced Bandit Algorithms Research