Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models
Inoussa Mouiche

TL;DR
This paper introduces Gate-DPO, a method that stabilizes preference optimization in language models by modulating gradients based on probability geometry, reducing training collapse and improving response quality.
Contribution
Gate-DPO is a novel gradient modulation technique that stabilizes preference optimization without altering the core objective, enhancing model alignment and training stability.
Findings
Gate-DPO reduces squeezing effects during training.
Smaller gated models outperform larger ungated models in response quality.
Mass-dynamics analysis shows healthier optimization with Gate-DPO.
Abstract
Preference optimization has become a central paradigm for aligning large language models with human feedback. Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback by directly optimizing pairwise preferences, removing the need for reward modeling and policy optimization. However, recent work shows that DPO exhibits a squeezing effect, where negative gradients applied to rejected responses concentrate probability mass on high-confidence predictions while suppressing alternative responses. This phenomenon arises even in simple softmax models and can lead to systematic probability collapse during training. We introduce Gradient-Gated Preference Optimization (Gate-DPO), a method that stabilizes training by modulating rejected gradients according to the model's probability geometry. When updates target extremely low-probability responses, the gate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
