BPO: Revisiting Preference Modeling in Direct Preference Optimization
Lin Sun, Chuang Liu, Peng Liu, Bingyang Li, Weijia Lu, Ning Wu

TL;DR
This paper introduces Balanced Preference Optimization (BPO), a new framework that improves preference modeling in Large Language Models by addressing the limitations of Direct Preference Optimization (DPO), leading to better performance and simplicity.
Contribution
BPO offers a novel method to balance chosen and rejected responses in preference optimization, resolving DPO's Degraded Chosen Responses issue without extra constraints.
Findings
BPO improves accuracy by over 10% on mathematical reasoning tasks.
BPO outperforms DPO and variants across multiple models.
Implementation requires only a single line of code change.
Abstract
Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through pairwise ranking losses, it often neglects absolute reward magnitudes. This oversight can decrease the likelihood of chosen responses and increase the risk of generating out-of-distribution responses, leading to poor performance. We term this issue Degraded Chosen Responses (DCR).To address this issue, we propose Balanced Preference Optimization (BPO), a novel framework that dynamically balances the optimization of chosen and rejected responses through two key components: balanced reward margin and gap adaptor. Unlike previous methods, BPO can fundamentally resolve DPO's DCR issue, without introducing additional constraints to the loss function.…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper provides a theoretical justification for BPO, including a gradient analysis and Theorem 1, which ensures the learned policy maintains a minimum likelihood for chosen responses and prevents the DCR problem. - The paper is well-structured and clearly written. This clear presentation makes the paper's contributions easy to follow.
- The experimental evaluation is limited to mathematical reasoning tasks. Consequently, it remains unclear whether BPO can generalize to other prevalent alignment objectives, such as instruction-following, helpfulness, or harmlessness. - While the paper claims "Accelerated Convergence" and "Reduced Computational Overhead" as key advantages of BPO over DPO, these claims are not supported by corresponding empirical evidence.
Strengths: 1. The paper is very well written 2. The paper methodology is indeed very simple and easy to integrate with DPO type algorithms resulting in adoptions 3. The authors provides a good set of experimental evaluation to begin with (some major comments on that look at weaknesses) 4. I like that the authors provide ablations with different loss types and tried two families of models
Weaknesses: 1. The field has moved way beyond DPO. The paper therefore lacks comparisons with key strong baselines, like SimPO or KTO or ODPO or BDPO or ORPO, some of which tackles the diminishing log prob of chosen samples as a problem. Only Vanilla DPO is not a reasonable baseline, obviously this is going to perform better than DPO. The authors must compare performance with SOTA right now that constitutes improvement over the main DPO algorithm and not restrict themselves to just the algorith
Novel Objective - The paper introduces a new objective to address the decrease in likelihood for preferred responses through the balanced reward margin which takes the minimum of the reward of the preferred response and the negative reward of the unpreferred response. This creates an optimization landscape that avoids likelihoods of both responses decreasing addressing the issue and they demonstrate improved performance on math reasoning benchmarks by reducing likelihood displacement. Empirica
Experimental Setup - The experiments focus on applying DPO and variants to math reasoning, but math reasoning training is often done with RL methods such as PPO or GRPO and works that do utilize DPO involve editing responses/updating preference data such as the Llama 3 paper cited. DPO is also most widely applied to human preference data such as HH-RLHF or UltraFeedback, so it is unclear whether BPO would lead to improvements for more common applications of DPO. An evaluation of the methods on s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Constraint Satisfaction and Optimization · Recommender Systems and Techniques
