BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling
Lin Gui, Cristina G\^arbacea, Victor Veitch

TL;DR
This paper analyzes best-of-$n$ sampling for aligning large language models to human preferences, showing its optimality in maximizing win rate and proposing a fine-tuning method called BoNBoN alignment to efficiently mimic this distribution.
Contribution
The paper establishes the optimality of best-of-$n$ sampling in alignment and introduces BoNBoN alignment for efficient imitation of this distribution.
Findings
Best-of-$n$ is essentially optimal for maximizing win rate.
BoNBoN alignment improves preference alignment with minimal off-target effects.
Mimicking best-of-$n$ distribution enhances model preference without significant performance loss.
Abstract
This paper concerns the problem of aligning samples from large language models to human preferences using best-of- sampling, where we draw samples, rank them, and return the best one. We consider two fundamental problems. First: what is the relationship between best-of- and approaches to alignment that train LLMs to output samples with a high expected reward (e.g., RLHF or DPO)? To answer this, we embed both the best-of- distribution and the sampling distributions learned by alignment procedures in a common class of tiltings of the base LLM distribution. We then show that, within this class, best-of- is essentially optimal in terms of the trade-off between win-rate against the base model vs KL distance from the base model. That is, best-of- is the best choice of alignment distribution if the goal is to maximize win rate. However, best-of- requires drawing …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling
MethodsBalanced Selection
