Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis
Gholamali Aminian, Idan Shenfeld, Amir R. Asadi, Ahmad Beirami, Youssef Mroueh

TL;DR
This paper analyzes Best-of-N methods for generative model alignment, introducing a smooth variant called SBoN, and provides theoretical bounds on KL divergence and regret to improve understanding of their performance.
Contribution
It develops a theoretical framework for SBoN, deriving bounds on KL divergence and regret, and demonstrates how smoothing mitigates reward overoptimization.
Findings
SBoN reduces reward overoptimization compared to BoN.
Theoretical bounds relate sample size to divergence and regret.
Empirical results confirm smoothing benefits in low-quality proxy reward scenarios.
Abstract
A simple yet effective method for inference-time alignment of generative models is Best-of- (BoN), where outcomes are sampled from a reference policy, evaluated using a proxy reward model, and the highest-scoring one is selected. While prior work argues that BoN is almost optimal in reward vs KL tradeoffs, the effectiveness of BoN depends critically on the quality of the proxy reward model used for selection. For this purpose, we study BoN through a smooth version known as Soft Best-of-N (SBoN) and develop a theoretical framework to address this gap. We analyze the scaling behaviour of BoN by providing bounds on the KL divergence between the SBoN policy and the reference policy, offering insights into how performance varies with the number of samples. We also study the regret gap, i.e., the gap between the expected true reward under the optimal policy and the SBoN policy. Our…
Peer Reviews
Decision·ICLR 2026 Poster
- The poor calibration of reward model is a widely known problem in the community, which will largely affect the BoN (test time scaling) results as reward models are sensitive to OOD data, and one single poor-rated trajectory can influence the final BoN result. And the paper proposed SBoN seems show better performance.
- The experiments are limited, I suggest to verify the effectiveness by conducting more experiments on recent non-verifiable agentic tasks.
- The paper is self-contained, even though I am not an expert in this particular domain I could get the main results that are quite technical - The small experiment corroborates the theoretical findings
- Sometimes its not entirely clear what are the key novel results of this paper compared to other papers (e.g. Huang 2025), it would be good to highlight this more - From a rough reading of Huang 2025 it seems they also aim to solve this exact problem. How does your proposed algorithm compare to theirs? Note that I am not asking for an experiment or anything, but a theoretical interpretation would also be fine for me (both would be even stronger). - This paper has no Discussion section which I
* Provides finite-sample bounds on the KL divergence between SBoN and the reference policy under the true reward model, as well as between SBoN policies under the true and proxy reward models. These results generalize across different values of $\beta$ (the temperature parameter for the Shannon entropy penalty used to smooth probabilities). * Considers a calibrated proxy-reward model, which is more realistic and relevant for practical applications. * I carefully read the paper, including the a
* The paper has overly complex notation, which significantly reduces readability and makes it difficult to follow the theoretical flow. Some notation should be refined or consistent for clarity. * The main theoretical results are difficult to interpret (see the Questions section for details).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Advanced Bandit Algorithms Research · Advanced Causal Inference Techniques
