Differential Information Distribution: A Bayesian Perspective on Direct Preference Optimization
Yunjae Won, Hyunji Lee, Hyeonbin Hwang, Minjoon Seo

TL;DR
This paper offers a Bayesian framework for understanding Direct Preference Optimization (DPO), revealing how its reward structure, training dynamics, and downstream performance are driven by the concept of differential information, thus providing a theoretical foundation and practical insights.
Contribution
It introduces the Differential Information Distribution (DID) to formalize preference optimization from a Bayesian perspective, clarifying the rationale behind DPO's reward and its effects on training and performance.
Findings
DPO's log-ratio reward is justified by preferences encoding Differential Information.
Training dynamics follow a power-law relationship with DID.
High-entropy DID enhances open-ended instruction-following, low-entropy improves knowledge-intensive QA.
Abstract
Direct Preference Optimization (DPO) has been widely used for aligning language models with human preferences in a supervised manner. However, several key questions remain unresolved: the rationale behind its log-ratio reward, how the statistical structure of preference datasets shapes its training dynamics, and how those dynamics impact downstream capabilities. We approach these questions from a Bayesian perspective, interpreting the goal of preference optimization as learning the differential information required to update a reference policy into a target policy. To formalize this view, we introduce the Differential Information Distribution (DID), defined as the distribution over samples that carry the Bayesian evidence required to update policies. We introduce three complementary insights by viewing preference optimization through the DID. First, we find that DPO's log-ratio reward…
Peer Reviews
Decision·Submitted to ICLR 2026
Framework - The paper provides a thorough analysis under the DID framework and show conditions under which the DPO reward is optimal. The framework utilizes a Bayesian perspective and brings a new approach to analyzing the behavior of DPO. The analysis leads to potential new insights on the effects of likelihood displacement and the structure of preference data.
Justification - The framework while interesting and novel lacks justifications for key assumptions and definitions. For example, in definition 2.2, it is unclear why should conditional independence and a bayesian update model preference data and learning well. The framework relies on this definition, so it is important that justification and empirical support is provided. Furthermore, in Theorem 4.1, it is assumed that either preferred or dispreferred responses are sampled from the reference mod
+ The paper introduces the Differential Information Distribution (DID), providing a deep understanding of how Bayesian evidence drives the updating of policies in DPO. + The paper demonstrates that the reward parameterization, training dynamics, and learned capabilities in DPO emerge naturally by analyzing DID. + By analyzing the Shannon entropy of the DID, this paper demonstrates how DID entropy influences the trade-off between factual accuracy and open-ended task performance via a real LLM e
+ The controlled Energy-Based Model experiments and empirical tests (Figure 1, Figure 2, Figure 3) in Section 3 and 4 are built around strong assumptions of matched data generating processes and synthetic settings where DIDs align almost perfectly. While some results (Table 1) use real-world LLMs and datasets, they are limited. It's uncertain whether the findings from synthetic setups apply to LLMs on real tasks. + There is the assumption that $\pi_{w} = \pi_{\text{ref}}$ in Section 4, along wi
1. Establishes a Bayesian formulation of preference optimization, linking DPO to information-theoretic evidence accumulation. 2. Explains DPO’s reward structure, dynamics, and task-dependent behaviors (open-ended vs factual) under one consistent lens. 3. Provides closed-form derivations (e.g., Likelihood Ratio Representation, Entropy of DID) that connect policy updates to Bayesian ratios. 4. Offers a principled way to reason about why different `β` or entropy configurations produce distinc
1. The conditional independence of `X` from the prior and the power-law DID assumption are not empirically testable or demonstrated to hold in real preference data. 2. DID “existence” is defined by construction, not derived, which weakens claims of theoretical generality. 3. DID entropy estimation uses a small sample (`K=32`) with potentially large variance; no confidence intervals or significance testing are reported. 4. Theorem 3.2’s “unique justification” of the log-ratio reward is cont
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConsumer Market Behavior and Pricing
MethodsDirect Preference Optimization
