Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift
Chengtao Jian, Kai Yang, Tianhao Gao, Wuguang Ni, Keying Yang, Bowen Xiao, Jiajun Liu, Ye Ouyang

TL;DR
This paper identifies a critical failure mode in preference learning called Catastrophic Preference Shift, analyzes its causes, and proposes a Stable Preference Optimization framework to mitigate it, improving model alignment stability.
Contribution
The paper introduces a theoretically grounded SPO framework that constrains preference learning, addressing the limitations of existing BT-style methods and enhancing alignment reliability.
Findings
SPO stabilizes preference learning and improves performance.
Existing BT-style methods suffer from preference shift and performance degradation.
SPO outperforms baseline methods in empirical evaluations.
Abstract
Direct Preference Learning has emerged as a dominant offline paradigm for preference optimization. Most of these methods are based on the Bradley-Terry (BT) model for pairwise preference ranking, which directly aligns language model with human preference. Prior work has observed a counter-intuitive phenomenon termed likelihood displacement, where the absolute probability of preferred responses decreases simultaneously during training. We demonstrate that such displacement can lead to a more devastating failure mode, which we defined as \textit{Catastrophic Preference Shift}, where the lost preference probability mass inadvertently shifts toward out-of-distribution (OOD) responses. Such a failure mode is a key limitation shared across BT-style direct preference learning methods, due to the fundamental conflict between the unconstrained discriminative alignment and generative foundational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Advanced Multi-Objective Optimization Algorithms · Multi-Criteria Decision Making
