Mitigating Mismatch within Reference-based Preference Optimization

Suqin Yuan; Xingrui Yu; Jiyang Zheng; Lei Feng; Dadong Wang; Ivor Tsang; Tongliang Liu

arXiv:2602.11902·cs.LG·February 13, 2026

Mitigating Mismatch within Reference-based Preference Optimization

Suqin Yuan, Xingrui Yu, Jiyang Zheng, Lei Feng, Dadong Wang, Ivor Tsang, Tongliang Liu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces HyPO, a modification to DPO that conditionally debiases the reference signal to improve preference alignment of large language models, especially on pessimistic pairs.

Contribution

HyPO offers a simple, effective change to DPO that mitigates premature satisfaction and enhances preference alignment without additional computational cost.

Findings

01

HyPO improves inference-aligned metrics.

02

HyPO achieves higher pairwise win rates.

03

HyPO effectively mitigates training-inference mismatch.

Abstract

Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models, but its reliance on a reference policy introduces a critical tension. DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region. This reliance becomes problematic for pessimistic pairs, where the reference model prefers the rejected response. For these pairs, DPO prematurely attenuates the gradient as soon as the policy margin ( $Δ_{θ}$ ) merely beats the reference margin ( $Δ_{ref}$ ) even if the policy is still wrong ( $Δ_{θ} < 0$ ). We name this failure premature satisfaction, which is a concrete form of the training-inference mismatch. Reference-free objectives remove this mismatch by optimizing the absolute margin, but at the cost of discarding the stabilizing…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. This work is well-motivated. The paper identifies a concrete limitation of DPO and proposes a simple yet effective idea to fix it. 2. Strong numerical behavior. The experiments on Mistral and Llama are reported with significant improvements over existing baseline methods (e.g., DPO and SimPO).

Weaknesses

1. The Qwen series models are missing. It would be better to also include one of them to run the tests. 2. The sample code is not provided. It would be better to share more details on experimental configurations.

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper is well-organized, moving naturally from the identification of DPO’s weakness to the proposal of a concise and theoretically grounded fix, with clear motivation. 2. HyPO establishes a thoughtful balance between reference-based and reference-free optimization, combining their respective advantages while eliminating their key drawbacks. 3. The authors provide strong evidence that pessimistic samples are a real and significant issue, and this observation directly supports their desig

Weaknesses

1. The paper uses sequence-level log-likelihood difference to decide “pessimistic”, but explicitly “does not length-normalize”. Although evaluation uses length-controlled metrics (e.g., AlpacaEval LC), the training-time pessimism criterion remains length-sensitive. 2. The paper fails to distinguish cases where a negative reference margin reflects genuine “pessimism” from cases where it arises due to noisy or incorrect preference labels, in which the reference may in fact be the more reliable an

Reviewer 03Rating 6Confidence 4

Strengths

1. Well Motivated Solution: The paper clearly identifies a potential issue of reference model in DPO optimization and provides a simple yet effective solution ($\max(0, \Delta_{ref})$) to address it. 2. Stable Empirical Improvement: The proposed method is empirically validated on different models and benchmarks to improve over several existing DPO variants. The downstream task evaluation further demonstrates the proposed method avoids performance degredation.

Weaknesses

1. **Introduced Hyperparameter**: In practice, the objective is implemented with $\Delta_\theta - \max(\Delta_{ref}, \gamma) - h$ which introduces two additional hyperparameters $\gamma$ and $h$ to DPO. This complicates the hyperparameter tuning process in practice. 2. **Concerns about Ablation Results**: The motivation of the proposed HyPO method is based on the "premature satisfaction" problem, which the authors solve by applying the relu function to the reference margin $\max(\Delta_{ref}, \

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Bayesian Modeling and Causal Inference · Explainable Artificial Intelligence (XAI)