Small-Margin Preferences Still Matter-If You Train Them Right
Jinlong Pang, Zhaowei Zhu, Na Di, Yichi Zhang, Yaxuan Wang, Chen Qian, Yang Liu

TL;DR
This paper introduces MixDPO, a difficulty-aware training method for preference-based alignment of large language models, effectively leveraging ambiguous pairs by combining preference loss and supervised fine-tuning to improve alignment performance.
Contribution
The paper proposes MixDPO, a novel curriculum-based hybrid training strategy that improves preference-based model alignment by handling difficult pairs more effectively.
Findings
MixDPO outperforms DPO and variants on multiple benchmarks.
It achieves significant gains on AlpacaEval 2 length-controlled win rate.
Difficulty-aware training stabilizes preference-based alignment methods.
Abstract
Preference optimization methods such as DPO align large language models (LLMs) using paired comparisons, but their effectiveness can be highly sensitive to the quality and difficulty of preference pairs. A common heuristic treats small-margin (ambiguous) pairs as noisy and filters them out. In this paper, we revisit this assumption and show that pair difficulty interacts strongly with the optimization objective: when trained with preference-based losses, difficult pairs can destabilize training and harm alignment, yet these same pairs still contain useful supervision signals when optimized with supervised fine-tuning (SFT). Motivated by this observation, we propose MixDPO, a simple yet effective difficulty-aware training strategy that (i) orders preference data from easy to hard (a curriculum over margin-defined difficulty), and (ii) routes difficult pairs to an SFT objective while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Constraint Satisfaction and Optimization · Machine Learning and Data Classification
