Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

Shangpin Peng; Weinong Wang; Zhuotao Tian; Senqiao Yang; Xing Wu; Haotian Xu; Chengquan Zhang; Takashi Isobe; Baotian Hu; Min Zhang

arXiv:2506.10054·cs.LG·February 12, 2026

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang

PDF

Open Access 9 Models 1 Datasets 3 Reviews

TL;DR

Uni-DPO introduces a dynamic preference optimization framework that adaptively reweights preference pairs based on data quality and model performance, leading to improved efficiency and superior results in training large language models.

Contribution

It proposes a novel unified framework for preference optimization that considers data quality and model performance, enhancing data utilization and model performance.

Findings

01

Outperforms baseline methods across multiple benchmarks.

02

Achieves 6.7 points higher on Arena-Hard with Gemma-2-9B-IT.

03

Demonstrates robustness and generalization across tasks.

Abstract

Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based methods typically treat all preference pairs equally, overlooking substantial variations in data quality and learning difficulty, which leads to inefficient data utilization and suboptimal performance. To address this limitation, we propose Uni-DPO, a unified dynamic preference optimization framework that jointly considers (a) the inherent quality of preference pairs and (b) the model's evolving performance during training. By adaptively reweighting samples based on both factors, Uni-DPO enables more effective use of preference data and achieves superior performance. Extensive experiments across models and benchmarks demonstrate the effectiveness and generalization of Uni-DPO. On textual tasks,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- Clear motivation and formulation: The paper explicitly identifies DPO's uniform-pair limiattions and proposes a dual weighting + c-NLL objective. - Empirical gains across domains: Reported improvements over SFT/DPO/SimPO on text, meth and multimodal data depicts the strength of Uni-DPO - Ablations: The paper shows that removing each of the component degrades performance, empirically supoprting the necessity of each component.

Weaknesses

- Lack of transparency and reproducibility of the scalar quality score $w_{\text{qual}}$. - The core of Uni-DPO's contribution, $w_{\text{qual}}$, depends on external scalar scores produced by "expert evaluators" (e.g., GPT-4). However, the exact procedure is not disclosed. Since $w_{\text{qual}}$ is an important component of Uni-DPO, the procedure of obtaining this score should be thoroughly described in the paper. - Insufficient engagement with existing sample-wise weighting preference opt

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper clearly identifies that standard DPO treats all preference pairs uniformly, which underutilizes high-quality feedback and fails to adapt to varying task difficulty. Uni-DPO tackles this by adaptive sample weighting, allowing the model to focus on informative training examples and thus improving learning efficiency. This insight targets an important problem in RLHF. 2. The quality aware weight prioritizes pairs with larger expert score margins, while the performance aware weight, ins

Weaknesses

1. A notable concern is that Uni-DPO requires a “quality score” for each preference pair, often obtained via human annotation or a powerful proxy model like GPT-4. This introduces an extra dependency on external evaluators (in effect, a form of reward signal), partially undermining the simplicity of the reward model-free DPO paradigm. If these quality scores are noisy, biased, or unavailable, it’s unclear how well the method would perform. The authors themselves acknowledge that training data qu

Reviewer 03Rating 4Confidence 3

Strengths

- This paper derives the gradient coefficient of Uni-DPO in closed form, explicitly integrating two modulation factors, including quality weight and performance weight, to provide a principled explanation for online sample re-weighting. - Uni-DPO preserves the simplicity of single-stage offline training without introducing additional reward models or iterative sampling overhead. With only two learnable weights and a calibrated loss, it can achieve consistent performance improvements across diver

Weaknesses

- Although the paper proposes a "unified dynamic weighting paradigm", its core idea essentially combines a quality-aware weight with a performance-aware weight. Such sample reweighting concepts have already been extensively studied in machine learning and RLHF literature, including focal loss, curriculum learning, and advantage reweighting. Therefore, the conceptual novelty of Uni-DPO appears somewhat incremental rather than fundamentally new. - A major concern lies in the lack of essential bas

Code & Models

Models

Datasets

psp-dada/Uni-DPO
dataset· 153 dl
153 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Multi-Objective Optimization Algorithms · Recommender Systems and Techniques