Why DPO is a Misspecified Estimator and How to Fix It
Aditya Gopalan, Sayak Ray Chowdhury, Debangshu Banerjee

TL;DR
This paper analyzes the limitations of Direct Preference Optimization (DPO) as a misspecified estimator and introduces AuxDPO, an improved method that mitigates these issues for better language model alignment.
Contribution
The paper reveals DPO's misspecification problem and proposes AuxDPO, a novel method with auxiliary variables to improve alignment performance.
Findings
AuxDPO outperforms DPO in didactic bandit settings
AuxDPO achieves better alignment results on LLM tasks
DPO can cause preference reversal and reward degradation
Abstract
Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss…
Peer Reviews
Decision·ICLR 2026 Oral
The paper is well written and clearly structured. Provides an insightful theoretical analysis of DPO and RLHF via Taylor approximation, revealing: - the local geometry of DPO under parametric policies, - the local geometry of RLHF optimization, and - the relationship between RLHF equivalence classes and DPO linearization. Proposes a novel and principled solution (AuxDPO) to address the identified misspecification issue.
Could oversampling or undersampling preference pairs to balance frequencies before DPO training mitigate the misspecification issue and yield comparable performance? There is no discussion or experimental analysis of AuxDPO's sensitivity to its core hyperparameters ($\lambda$ and $n$). How should these values be chosen in practice?
* the paper is mathematically rigorous in demonstrating its claims. * the work does a good job demonstrating why the mis-specification is a problem through a useful example and follow-up points. * the results demonstrate strong performance in both in distribution and out of distribution settings.
* the derivation of the aux DPO objective makes sense, but is justified using a local approximation. While this makes sense, it could be worth pointing out the the AuxDPO solution (at least to my understanding) holds under these approximations only. * The paper is a bit hard to follow at times as it is particularly dense. I think at various points in the manuscript having more motivation and explanation would be helpful. Why do we want to do a first order approx? What is the meaning of the A ma
This paper proposes an interesting idea to further improve the quality of DPO, based on principled local information geometry. The proposed technique is not only principled but also effective in practice.
The only downside of the paper is that it is very notation heavy and not easy to follow in the first read.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Recommender Systems and Techniques · Advanced Bandit Algorithms Research
