BPO: Revisiting Preference Modeling in Direct Preference Optimization

Lin Sun; Chuang Liu; Peng Liu; Bingyang Li; Weijia Lu; Ning Wu

arXiv:2506.03557·cs.CL·June 5, 2025

BPO: Revisiting Preference Modeling in Direct Preference Optimization

Lin Sun, Chuang Liu, Peng Liu, Bingyang Li, Weijia Lu, Ning Wu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Balanced Preference Optimization (BPO), a new framework that improves preference modeling in Large Language Models by addressing the limitations of Direct Preference Optimization (DPO), leading to better performance and simplicity.

Contribution

BPO offers a novel method to balance chosen and rejected responses in preference optimization, resolving DPO's Degraded Chosen Responses issue without extra constraints.

Findings

01

BPO improves accuracy by over 10% on mathematical reasoning tasks.

02

BPO outperforms DPO and variants across multiple models.

03

Implementation requires only a single line of code change.

Abstract

Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through pairwise ranking losses, it often neglects absolute reward magnitudes. This oversight can decrease the likelihood of chosen responses and increase the risk of generating out-of-distribution responses, leading to poor performance. We term this issue Degraded Chosen Responses (DCR).To address this issue, we propose Balanced Preference Optimization (BPO), a novel framework that dynamically balances the optimization of chosen and rejected responses through two key components: balanced reward margin and gap adaptor. Unlike previous methods, BPO can fundamentally resolve DPO's DCR issue, without introducing additional constraints to the loss function.…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

- The paper provides a theoretical justification for BPO, including a gradient analysis and Theorem 1, which ensures the learned policy maintains a minimum likelihood for chosen responses and prevents the DCR problem. - The paper is well-structured and clearly written. This clear presentation makes the paper's contributions easy to follow.

Weaknesses

- The experimental evaluation is limited to mathematical reasoning tasks. Consequently, it remains unclear whether BPO can generalize to other prevalent alignment objectives, such as instruction-following, helpfulness, or harmlessness. - While the paper claims "Accelerated Convergence" and "Reduced Computational Overhead" as key advantages of BPO over DPO, these claims are not supported by corresponding empirical evidence.

Reviewer 02Rating 4Confidence 4

Strengths

Strengths: 1. The paper is very well written 2. The paper methodology is indeed very simple and easy to integrate with DPO type algorithms resulting in adoptions 3. The authors provides a good set of experimental evaluation to begin with (some major comments on that look at weaknesses) 4. I like that the authors provide ablations with different loss types and tried two families of models

Weaknesses

Weaknesses: 1. The field has moved way beyond DPO. The paper therefore lacks comparisons with key strong baselines, like SimPO or KTO or ODPO or BDPO or ORPO, some of which tackles the diminishing log prob of chosen samples as a problem. Only Vanilla DPO is not a reasonable baseline, obviously this is going to perform better than DPO. The authors must compare performance with SOTA right now that constitutes improvement over the main DPO algorithm and not restrict themselves to just the algorith

Reviewer 03Rating 2Confidence 4

Strengths

Novel Objective - The paper introduces a new objective to address the decrease in likelihood for preferred responses through the balanced reward margin which takes the minimum of the reward of the preferred response and the negative reward of the unpreferred response. This creates an optimization landscape that avoids likelihoods of both responses decreasing addressing the issue and they demonstrate improved performance on math reasoning benchmarks by reducing likelihood displacement. Empirica

Weaknesses

Experimental Setup - The experiments focus on applying DPO and variants to math reasoning, but math reasoning training is often done with RL methods such as PPO or GRPO and works that do utilize DPO involve editing responses/updating preference data such as the Llama 3 paper cited. DPO is also most widely applied to human preference data such as HH-RLHF or UltraFeedback, so it is unclear whether BPO would lead to improvements for more common applications of DPO. An evaluation of the methods on s

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Constraint Satisfaction and Optimization · Recommender Systems and Techniques