Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization
Haocheng Luo, Zehang Deng, Thanh-Toan Do, Mehrtash Harandi, Dinh Phung, Trung Le

TL;DR
This paper introduces logits-SAM, a curvature-regularization method that enhances Direct Preference Optimization (DPO) for large language models by mitigating the squeezing effect, leading to more stable and preference-aligned training.
Contribution
We develop a theoretical framework explaining the squeezing effect in DPO and propose logits-SAM, a computationally efficient variant that improves DPO's effectiveness in aligning language models.
Findings
Logits-SAM consistently improves DPO performance across multiple models and datasets.
Theoretical analysis links the squeezing effect to high-curvature directions in logit space.
Logits-SAM introduces negligible computational overhead while enhancing training stability.
Abstract
Direct Preference Optimization (DPO) has emerged as a popular algorithm for aligning pretrained large language models with human preferences, owing to its simplicity and training stability. However, DPO suffers from the recently identified squeezing effect (also known as likelihood displacement), where the probability of preferred responses decreases unintentionally during training. To understand and mitigate this phenomenon, we develop a theoretical framework that models the coordinate-wise dynamics in logit space. Our analysis reveals that negative-gradient updates cause residuals to expand rapidly along high-curvature directions, which underlies the squeezing effect, whereas Sharpness-Aware Minimization (SAM) can suppress this behavior through its curvature-regularization effect. Building on this insight, we investigate logits-SAM, a computationally efficient variant that perturbs…
Peer Reviews
Decision·ICLR 2026 Poster
This work has the following strengths: 1. This work studies on an important problem. 2. This work proposes to leverage SAM to mitigate the squeezing effect of DPO, with providing comprehensive theoretical evidences. 3. Extensive experiments on real-world datasets have been conducted to verify the efficacy of the proposed method.
I also have some concerns: 1. On the description of “gradient descent with a negative learning rate”: The phrase “gradient descent with a negative learning rate” is unconventional, as learning rates in DPO are typically positive. This may cause unnecessary confusion for readers. From the theoretical analysis, I understand that DPO may, in certain cases, apply a reversed gradient direction. However, it would be more precise to present this as theoretically equivalent to using a negative learnin
1. **Clear Theoretical Contribution:** Provides a unified logit-space dynamical analysis that links GD and SAM, pinpointing high-curvature mode amplification as the mechanism behind DPO ``squeezing,'' and proving sign-aligned SAM mitigates it. 2. **Practical, Efficient Method:** Introduces logits-SAM, an output-layer perturbation that retains curvature-aware regularization with minimal overhead (~2–3%) and seamless integration into existing DPO/SLiC-HF/CPO pipelines.
1. **Theory–practice gap:** Core analysis relies on first/second-order approximations in logit space (fixed features, softmax CE), which may not fully capture nonlinearity and parameter coupling in deep, decoder-only LMs. 2. **Final-layer perturbation bias:** Restricting SAM to the output layer improves efficiency but may miss sharp directions arising in earlier blocks/attention layers, potentially undercutting robustness on harder distributions. 3. **DPO-specific framing:** The mitigation is
- The authors present rigorous theoretical diagnosis for a known DPO failure mode. Correlating the "squeezing effect" to "high-curvature directions" in the logit space is a specific, actionable, and clear. The theoretical contributions and discussions are also clearly written, and presented as an elegant explanation for an existing harmful phenomenon. The clear theory to practical pipeline is also well motivated. Section 3.2 predicts how SAM should behave, Fig1-a,b confirm this prediction on a
- This method introduces a new highly sensitive hyperparameter ($\rho$). Table 3 demonstrates that while $\rho=10^{-4}$ might yield good results, the slightly larger $\rho=10^{-3}$ actually performs worse than the baseline. The optimal range seems extremely small. The majority of preference fine-tuning methods already encompass highly sensitive hyperparameters, and discussions about reference-based versus reference-free preference fine-tuning, in order to shift towards more robust solutions that
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Constraint Satisfaction and Optimization · Advanced Multi-Objective Optimization Algorithms
