Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback
Kyuyoung Kim, Ah Jeong Seo, Hao Liu, Jinwoo Shin, Kimin Lee

TL;DR
This paper introduces Margin Matching Preference Optimization (MMPO), a novel method that incorporates relative quality margins into LLM fine-tuning, resulting in improved performance and robustness over traditional binary preference methods.
Contribution
The paper proposes MMPO, a new preference optimization approach that uses quality margins and the Bradley-Terry model to enhance LLM alignment with granular feedback.
Findings
MMPO outperforms baseline methods on MT-bench and RewardBench.
The 7B model trained with MMPO achieves state-of-the-art results on RewardBench.
MMPO produces more robust and better-calibrated models.
Abstract
Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date. Despite their success, existing methods typically rely on simple binary labels, such as those indicating preferred outputs in pairwise preferences, which fail to capture the subtle differences in relative quality between pairs. To address this limitation, we introduce an approach called Margin Matching Preference Optimization (MMPO), which incorporates relative quality margins into optimization, leading to improved LLM policies and reward models. Specifically, given quality margins in pairwise preferences, we design soft target probabilities based on the Bradley-Terry model, which are then used to train models with the standard cross-entropy objective. Experiments with both human and AI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms
