Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts

Taneesh Gupta; Rahul Madhavan; Xuchao Zhang; Nagarajan Natarajan; Chetan Bansal; Saravan Rajmohan

arXiv:2412.04628·cs.LG·June 23, 2025

Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Nagarajan Natarajan, Chetan Bansal, Saravan Rajmohan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Multi-Preference Optimization (MPO), a novel extension of DPO that optimizes over response sets, improving language model alignment by leveraging set-level comparisons and outlier emphasis.

Contribution

MPO generalizes DPO to set-level preferences, employs deviation-based weighting for better learning, and provides theoretical bias reduction guarantees.

Findings

01

Achieves state-of-the-art on UltraFeedback benchmark.

02

Up to 17.5% improvement in length-controlled win rate on AlpacaEval2.

03

Theoretically reduces alignment bias at a rate of 1/√n.

Abstract

Direct Preference Optimization (DPO) has become a popular approach for aligning language models using pairwise preferences. However, in practical post-training pipelines, on-policy generation typically yields multiple candidate responses per prompt, which are scored by a reward model to guide learning. In this setting, we propose $Multi-Preference Optimization (MPO)$ , a generalization of DPO that optimizes over entire sets of responses by extending the Bradley-Terry model to groupwise comparisons between chosen and rejected sets. To further enhance learning, MPO employs deviation-based weighting, which emphasizes outlier responses that deviate most from the mean reward, effectively inducing a self-paced curriculum. We theoretically prove that MPO reduces alignment bias at a rate of $O (\frac{1}{n})$ with respect to the number of responses per query.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

- The paper addresses an important challenge in the alignment process, and the presented method is a clean and intuitive generalization of DPO, moving from pairwise to set-wise comparisons. - The method is shown to achieve state-of-the-art results across a variety of models, benchmarks, and training paradigms (off-policy, on-policy, iterative), demonstrating its robustness and effectiveness, especially on AlpacaEval2. - The paper provides both theoretical motivation (Theorems 1 & 2) and strong e

Weaknesses

- The on-policy results depend on a single reward model. It is possible that MPO is particularly good at optimizing for the specific reward distribution of the Skywork RM, and its gains might be less pronounced with other RMs. - The theoretical result on noise robustness (Theorem 2) relies on a specific "spacing-scaled" noise model. It is unclear how realistic the assumption is. - Relying on RM leaves actual alignment to be questionable; the study would benefit from human evaluation and more qu

Reviewer 02Rating 4Confidence 3

Strengths

1. This paper provides theoretical evidence on why MPO works better than other DPO-style methods. 2. The experiments are conducted and MPO are compared with several strong baselines.

Weaknesses

1. What is the difference between the "Off-policySetting" and "On-policySetting"? It seems that they only differ in the initial model (off-policy uses a weaker sft model, while the on-policy uses a stronger open-sourced instruct model). If so, why they get this name? Based on my understanding, Off-policy and On-policy should be different in how they are trained (sample from base model or from current policy model). 2. In the experiments, it seems some of strong baseline models are missing -- Sim

Reviewer 03Rating 0Confidence 5

Strengths

N/A

Weaknesses

N/A

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConsumer Market Behavior and Pricing