Theoretical Limits of Language Model Alignment
Lucas Monteiro Paes, Natalie Mackraz, Barry-John Theobald, Federico Danieli

TL;DR
This paper explores the fundamental theoretical limits of language model alignment under KL constraints, providing formulas, practical estimators, and empirical analysis to understand reward improvements and the effectiveness of alignment techniques.
Contribution
It derives the maximum achievable reward gain under a KL budget, introduces a covariance-based estimator, and analyzes reward hacking and ensembling, advancing the theoretical understanding of LM alignment.
Findings
Best-of-N alignment approaches closely approach the theoretical limit.
PPO and GRPO are significantly suboptimal compared to the theoretical maximum.
Reward ensembling mitigates reward hacking, improving alignment robustness.
Abstract
Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of- alignment, which selects the highest-reward output among independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the used in prior analyses. We further reformulate this expression as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
