Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs
Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang

TL;DR
Uni-DPO introduces a dynamic preference optimization framework that adaptively reweights preference pairs based on data quality and model performance, leading to improved efficiency and superior results in training large language models.
Contribution
It proposes a novel unified framework for preference optimization that considers data quality and model performance, enhancing data utilization and model performance.
Findings
Outperforms baseline methods across multiple benchmarks.
Achieves 6.7 points higher on Arena-Hard with Gemma-2-9B-IT.
Demonstrates robustness and generalization across tasks.
Abstract
Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based methods typically treat all preference pairs equally, overlooking substantial variations in data quality and learning difficulty, which leads to inefficient data utilization and suboptimal performance. To address this limitation, we propose Uni-DPO, a unified dynamic preference optimization framework that jointly considers (a) the inherent quality of preference pairs and (b) the model's evolving performance during training. By adaptively reweighting samples based on both factors, Uni-DPO enables more effective use of preference data and achieves superior performance. Extensive experiments across models and benchmarks demonstrate the effectiveness and generalization of Uni-DPO. On textual tasks,…
Peer Reviews
Decision·ICLR 2026 Poster
- Clear motivation and formulation: The paper explicitly identifies DPO's uniform-pair limiattions and proposes a dual weighting + c-NLL objective. - Empirical gains across domains: Reported improvements over SFT/DPO/SimPO on text, meth and multimodal data depicts the strength of Uni-DPO - Ablations: The paper shows that removing each of the component degrades performance, empirically supoprting the necessity of each component.
- Lack of transparency and reproducibility of the scalar quality score $w_{\text{qual}}$. - The core of Uni-DPO's contribution, $w_{\text{qual}}$, depends on external scalar scores produced by "expert evaluators" (e.g., GPT-4). However, the exact procedure is not disclosed. Since $w_{\text{qual}}$ is an important component of Uni-DPO, the procedure of obtaining this score should be thoroughly described in the paper. - Insufficient engagement with existing sample-wise weighting preference opt
1. The paper clearly identifies that standard DPO treats all preference pairs uniformly, which underutilizes high-quality feedback and fails to adapt to varying task difficulty. Uni-DPO tackles this by adaptive sample weighting, allowing the model to focus on informative training examples and thus improving learning efficiency. This insight targets an important problem in RLHF. 2. The quality aware weight prioritizes pairs with larger expert score margins, while the performance aware weight, ins
1. A notable concern is that Uni-DPO requires a “quality score” for each preference pair, often obtained via human annotation or a powerful proxy model like GPT-4. This introduces an extra dependency on external evaluators (in effect, a form of reward signal), partially undermining the simplicity of the reward model-free DPO paradigm. If these quality scores are noisy, biased, or unavailable, it’s unclear how well the method would perform. The authors themselves acknowledge that training data qu
- This paper derives the gradient coefficient of Uni-DPO in closed form, explicitly integrating two modulation factors, including quality weight and performance weight, to provide a principled explanation for online sample re-weighting. - Uni-DPO preserves the simplicity of single-stage offline training without introducing additional reward models or iterative sampling overhead. With only two learnable weights and a calibrated loss, it can achieve consistent performance improvements across diver
- Although the paper proposes a "unified dynamic weighting paradigm", its core idea essentially combines a quality-aware weight with a performance-aware weight. Such sample reweighting concepts have already been extensively studied in machine learning and RLHF literature, including focal loss, curriculum learning, and advantage reweighting. Therefore, the conceptual novelty of Uni-DPO appears somewhat incremental rather than fundamentally new. - A major concern lies in the lack of essential bas
Code & Models
- 🤗psp-dada/Llama-3-8B-Base-SFT-Uni-DPO-v2-GPT-4model· 3 dl· ♡ 13 dl♡ 1
- 🤗psp-dada/Gemma2-9B-IT-Uni-DPOmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗psp-dada/Llama-3-8B-Base-SFT-Uni-DPO-v2-Qwenmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗psp-dada/Llama-3-8B-Base-SFT-Uni-DPOmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗psp-dada/Llama-3-8B-Instruct-Uni-DPO-v2-ArmoRMmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗psp-dada/Llama-3-8B-Instruct-Uni-DPO-v2-GPT-4omodel· 3 dl· ♡ 13 dl♡ 1
- 🤗psp-dada/Qwen2.5-7B-Uni-DPOmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗psp-dada/Llama-3-8B-Instruct-Uni-DPOmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗psp-dada/Qwen2.5-Math-7B-Uni-DPOmodel· 3 dl· ♡ 13 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Multi-Objective Optimization Algorithms · Recommender Systems and Techniques
