Theoretical guarantees on the best-of-n alignment policy

Ahmad Beirami; Alekh Agarwal; Jonathan Berant; Alexander D'Amour; and Jacob Eisenstein; Chirag Nagpal; Ananda Theertha Suresh

arXiv:2401.01879·cs.LG·May 30, 2025·1 cites

Theoretical guarantees on the best-of-n alignment policy

Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander D'Amour, and Jacob Eisenstein, Chirag Nagpal, Ananda Theertha Suresh

PDF

Open Access

TL;DR

This paper critically examines the theoretical properties of best-of-$n$ sampling in generative models, correcting a common misconception about KL divergence bounds, and introduces a new estimator to better understand the tradeoffs involved.

Contribution

It disproves a widely cited analytical expression for KL divergence in best-of-$n$ sampling, proposes a new estimator, and analyzes the tradeoffs between win rate and divergence.

Findings

01

The claimed KL divergence formula is an upper bound, not an exact value.

02

A new KL divergence estimator provides a tighter approximation.

03

Good tradeoffs between win rate and divergence are achievable with less than 1000 samples.

Abstract

A simple and effective method for the inference-time alignment and scaling test-time compute of generative models is best-of- $n$ sampling, where $n$ samples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of- $n$ policy and the reference policy is equal to $lo g (n) - (n - 1) / n .$ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We also show that the win rate of the best-of- $n$ policy against the reference policy is upper bounded by $n / (n + 1)$ and derive bounds on the tightness of this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSports Analytics and Performance · Explainable Artificial Intelligence (XAI) · Data Analysis with R

MethodsBalanced Selection