Theoretical guarantees on the best-of-n alignment policy
Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander D'Amour, and Jacob Eisenstein, Chirag Nagpal, Ananda Theertha Suresh

TL;DR
This paper critically examines the theoretical properties of best-of-$n$ sampling in generative models, correcting a common misconception about KL divergence bounds, and introduces a new estimator to better understand the tradeoffs involved.
Contribution
It disproves a widely cited analytical expression for KL divergence in best-of-$n$ sampling, proposes a new estimator, and analyzes the tradeoffs between win rate and divergence.
Findings
The claimed KL divergence formula is an upper bound, not an exact value.
A new KL divergence estimator provides a tighter approximation.
Good tradeoffs between win rate and divergence are achievable with less than 1000 samples.
Abstract
A simple and effective method for the inference-time alignment and scaling test-time compute of generative models is best-of- sampling, where samples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of- policy and the reference policy is equal to We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We also show that the win rate of the best-of- policy against the reference policy is upper bounded by and derive bounds on the tightness of this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSports Analytics and Performance · Explainable Artificial Intelligence (XAI) · Data Analysis with R
MethodsBalanced Selection
