TL;DR
This paper introduces a unified framework for preference optimization in language models, identifying conditions to suppress rejected responses while preserving preferred ones, and proposes a practical calibration method to improve training dynamics.
Contribution
It reveals a common incentive-score decomposition for different objectives and introduces the disentanglement band condition, along with a reward calibration method to enhance preference optimization.
Findings
Reward calibration improves disentangled training dynamics.
The method achieves better downstream performance.
The framework unifies analysis of various preference objectives.
Abstract
Preference optimization is widely used to align large language models (LLMs) with human preferences. However, many margin-based methods also suppress the chosen response when they try to suppress the rejected one, and there is no general way to prevent this across different objectives. We address this issue with a unified incentive-score decomposition of preference optimization, revealing that different objectives share the same local update directions and differ only in their scalar weights. This decomposition provides a common framework for analyzing objectives that were previously studied in separate settings. Building on this decomposition, by analyzing the dynamics of the chosen/rejected likelihoods, we identify the disentanglement band (DB), a simple, testable condition that tells us when training can follow the desired path: suppress the loser while preserving the winner,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
