Differential Information Distribution: A Bayesian Perspective on Direct Preference Optimization

Yunjae Won; Hyunji Lee; Hyeonbin Hwang; Minjoon Seo

arXiv:2505.23761·cs.LG·October 3, 2025

Differential Information Distribution: A Bayesian Perspective on Direct Preference Optimization

Yunjae Won, Hyunji Lee, Hyeonbin Hwang, Minjoon Seo

PDF

Open Access 3 Reviews

TL;DR

This paper offers a Bayesian framework for understanding Direct Preference Optimization (DPO), revealing how its reward structure, training dynamics, and downstream performance are driven by the concept of differential information, thus providing a theoretical foundation and practical insights.

Contribution

It introduces the Differential Information Distribution (DID) to formalize preference optimization from a Bayesian perspective, clarifying the rationale behind DPO's reward and its effects on training and performance.

Findings

01

DPO's log-ratio reward is justified by preferences encoding Differential Information.

02

Training dynamics follow a power-law relationship with DID.

03

High-entropy DID enhances open-ended instruction-following, low-entropy improves knowledge-intensive QA.

Abstract

Direct Preference Optimization (DPO) has been widely used for aligning language models with human preferences in a supervised manner. However, several key questions remain unresolved: the rationale behind its log-ratio reward, how the statistical structure of preference datasets shapes its training dynamics, and how those dynamics impact downstream capabilities. We approach these questions from a Bayesian perspective, interpreting the goal of preference optimization as learning the differential information required to update a reference policy into a target policy. To formalize this view, we introduce the Differential Information Distribution (DID), defined as the distribution over samples that carry the Bayesian evidence required to update policies. We introduce three complementary insights by viewing preference optimization through the DID. First, we find that DPO's log-ratio reward…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

Framework - The paper provides a thorough analysis under the DID framework and show conditions under which the DPO reward is optimal. The framework utilizes a Bayesian perspective and brings a new approach to analyzing the behavior of DPO. The analysis leads to potential new insights on the effects of likelihood displacement and the structure of preference data.

Weaknesses

Justification - The framework while interesting and novel lacks justifications for key assumptions and definitions. For example, in definition 2.2, it is unclear why should conditional independence and a bayesian update model preference data and learning well. The framework relies on this definition, so it is important that justification and empirical support is provided. Furthermore, in Theorem 4.1, it is assumed that either preferred or dispreferred responses are sampled from the reference mod

Reviewer 02Rating 6Confidence 4

Strengths

+ The paper introduces the Differential Information Distribution (DID), providing a deep understanding of how Bayesian evidence drives the updating of policies in DPO. + The paper demonstrates that the reward parameterization, training dynamics, and learned capabilities in DPO emerge naturally by analyzing DID. + By analyzing the Shannon entropy of the DID, this paper demonstrates how DID entropy influences the trade-off between factual accuracy and open-ended task performance via a real LLM e

Weaknesses

+ The controlled Energy-Based Model experiments and empirical tests (Figure 1, Figure 2, Figure 3) in Section 3 and 4 are built around strong assumptions of matched data generating processes and synthetic settings where DIDs align almost perfectly. While some results (Table 1) use real-world LLMs and datasets, they are limited. It's uncertain whether the findings from synthetic setups apply to LLMs on real tasks. + There is the assumption that $\pi_{w} = \pi_{\text{ref}}$ in Section 4, along wi

Reviewer 03Rating 4Confidence 4

Strengths

1. Establishes a Bayesian formulation of preference optimization, linking DPO to information-theoretic evidence accumulation. 2. Explains DPO’s reward structure, dynamics, and task-dependent behaviors (open-ended vs factual) under one consistent lens. 3. Provides closed-form derivations (e.g., Likelihood Ratio Representation, Entropy of DID) that connect policy updates to Bayesian ratios. 4. Offers a principled way to reason about why different `β` or entropy configurations produce distinc

Weaknesses

1. The conditional independence of `X` from the prior and the power-law DID assumption are not empirically testable or demonstrated to hold in real preference data. 2. DID “existence” is defined by construction, not derived, which weakens claims of theoretical generality. 3. DID entropy estimation uses a small sample (`K=32`) with potentially large variance; no confidence intervals or significance testing are reported. 4. Theorem 3.2’s “unique justification” of the log-ratio reward is cont

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConsumer Market Behavior and Pricing

MethodsDirect Preference Optimization