Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
Hiroshi Takahashi, Tomoharu Iwata, Atsutoshi Kumagai, Sekitoshi Kanai, Masanori Yamada, Kosuke Nishida, Kazutoshi Shinoda

TL;DR
This paper introduces a stable, statistically consistent method for aligning language models with human preferences by using a relative density ratio approach, improving training stability and convergence guarantees.
Contribution
The paper proposes a novel relative density ratio optimization method that enhances stability and statistical consistency over existing density ratio approaches for model alignment.
Findings
The new method is more stable and does not diverge during training.
It provides tighter convergence guarantees than previous methods.
Experimental results demonstrate effectiveness with Qwen 2.5 and Llama 3.
Abstract
Aligning language models with human preferences is essential for ensuring their safety and reliability. Although most existing approaches assume specific human preference models such as the Bradley-Terry model, this assumption may fail to accurately capture true human preferences, and consequently, these methods lack statistical consistency, i.e., the guarantee that language models converge to the true human preference as the number of samples increases. In contrast, direct density ratio optimization (DDRO) achieves statistical consistency without assuming any human preference models. DDRO models the density ratio between preferred and non-preferred data distributions using the language model, and then optimizes it via density ratio estimation. However, this density ratio is unstable and often diverges, leading to training instability of DDRO. In this paper, we propose a novel alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
