Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts
Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Nagarajan Natarajan, Chetan Bansal, Saravan Rajmohan

TL;DR
This paper introduces Multi-Preference Optimization (MPO), a novel extension of DPO that optimizes over response sets, improving language model alignment by leveraging set-level comparisons and outlier emphasis.
Contribution
MPO generalizes DPO to set-level preferences, employs deviation-based weighting for better learning, and provides theoretical bias reduction guarantees.
Findings
Achieves state-of-the-art on UltraFeedback benchmark.
Up to 17.5% improvement in length-controlled win rate on AlpacaEval2.
Theoretically reduces alignment bias at a rate of 1/√n.
Abstract
Direct Preference Optimization (DPO) has become a popular approach for aligning language models using pairwise preferences. However, in practical post-training pipelines, on-policy generation typically yields multiple candidate responses per prompt, which are scored by a reward model to guide learning. In this setting, we propose , a generalization of DPO that optimizes over entire sets of responses by extending the Bradley-Terry model to groupwise comparisons between chosen and rejected sets. To further enhance learning, MPO employs deviation-based weighting, which emphasizes outlier responses that deviate most from the mean reward, effectively inducing a self-paced curriculum. We theoretically prove that MPO reduces alignment bias at a rate of with respect to the number of responses per query.…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper addresses an important challenge in the alignment process, and the presented method is a clean and intuitive generalization of DPO, moving from pairwise to set-wise comparisons. - The method is shown to achieve state-of-the-art results across a variety of models, benchmarks, and training paradigms (off-policy, on-policy, iterative), demonstrating its robustness and effectiveness, especially on AlpacaEval2. - The paper provides both theoretical motivation (Theorems 1 & 2) and strong e
- The on-policy results depend on a single reward model. It is possible that MPO is particularly good at optimizing for the specific reward distribution of the Skywork RM, and its gains might be less pronounced with other RMs. - The theoretical result on noise robustness (Theorem 2) relies on a specific "spacing-scaled" noise model. It is unclear how realistic the assumption is. - Relying on RM leaves actual alignment to be questionable; the study would benefit from human evaluation and more qu
1. This paper provides theoretical evidence on why MPO works better than other DPO-style methods. 2. The experiments are conducted and MPO are compared with several strong baselines.
1. What is the difference between the "Off-policySetting" and "On-policySetting"? It seems that they only differ in the initial model (off-policy uses a weaker sft model, while the on-policy uses a stronger open-sourced instruct model). If so, why they get this name? Based on my understanding, Off-policy and On-policy should be different in how they are trained (sample from base model or from current policy model). 2. In the experiments, it seems some of strong baseline models are missing -- Sim
N/A
N/A
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConsumer Market Behavior and Pricing
