Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models
Rei Higuchi, Taiji Suzuki

TL;DR
This paper introduces Direct Density Ratio Optimization (DDRO), a novel, statistically consistent method for aligning large language models with human preferences by directly estimating preference ratios, improving reliability and performance.
Contribution
The paper proposes DDRO, a new alignment technique that avoids preference model assumptions and guarantees convergence to true preferences as data increases.
Findings
DDRO outperforms existing methods on major benchmarks.
DDRO is statistically consistent regardless of preference structure.
The approach enables more reliable, data-driven LLM alignment.
Abstract
Aligning large language models (LLMs) with human preferences is crucial for safe deployment, yet existing methods assume specific preference models like Bradley-Terry model. This assumption leads to statistical inconsistency, where more data doesn't guarantee convergence to true human preferences. To address this critical gap, we introduce a novel alignment method Direct Density Ratio Optimization (DDRO). DDRO directly estimates the density ratio between preferred and unpreferred output distributions, circumventing the need for explicit human preference modeling. We theoretically prove that DDRO is statistically consistent, ensuring convergence to the true preferred distribution as the data size grows, regardless of the underlying preference structure. Experiments demonstrate that DDRO achieves superior performance compared to existing methods on many major benchmarks. DDRO unlocks the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Topic Modeling · Gaussian Processes and Bayesian Inference
