BOND: Aligning LLMs with Best-of-N Distillation
Pier Giuseppe Sessa, Robert Dadashi, L\'eonard Hussenot, Johan Ferret,, Nino Vieillard, Alexandre Ram\'e, Bobak Shariari, Sarah Perrin, Abe Friesen,, Geoffrey Cideron, Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila, Sinopalnikov, Sabela Ramos, Am\'elie H\'eliou

TL;DR
BOND is a novel RLHF algorithm that efficiently mimics Best-of-N sampling by aligning model distributions, leading to improved summarization and model performance without high inference costs.
Contribution
BOND introduces a distribution matching approach using Jeffreys divergence to emulate Best-of-N sampling efficiently in RLHF.
Findings
Outperforms other RLHF methods on benchmarks
Improves summarization quality in experiments
Efficiently aligns model distributions with Best-of-N
Abstract
Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several…
Peer Reviews
Decision·ICLR 2025 Poster
The paper makes multiple contributions, namely theoretical derivation for the Best-of-N distribution and a practical RLHF finetuning algorithm that distills the Best-of-N distribution into a policy which is sample efficient and requires just one single sample at inference time The authors are making a lot of engineering design choices in their proposed model, and carefully analyze the role of each component in the performance of the proposed algorithm To regularize the model and ensure it is n
The paper combines a lot of distinct ideas already proposed in previous works - it would be good to actually clearly articulate what the novel contribution is. Besides, the comparison with concurrent works is not very clear, in particular the difference with (Amini et al, 2024), WARM, WARP (Rame et al, 2024). Figure 4 - It would be interesting to see how the performance of Best-of-N compares to the proposed algorithm J-BOND and REINFORCE Algorithm 1, line 330 - \pi_t is not defined Line 329
The paper is well-structured and clearly presents its methodology, with detailed explanations and algorithms that allow readers to follow the progression. From iterative BOND to the addition of KL regularization in Sections 4 and 5, the additional experimental results effectively support these methodological advancements. BOND is notable for its originality, offering a practical and computationally efficient alternative to traditional RLHF that achieves a superior KL-reward balance without requ
The paper relies heavily on the Jeffreys divergence without sufficient comparative analysis against alternative divergence metrics. The mode-covering and mode-seeking behavior property paper mentioned about are only observed in lower dimension such as multimodal distribution in 1-dimension. An inclusion of other divergence types, especially in the iterative stages, could offer clearer insights into the unique advantages of Jeffreys divergence. Further, relevant literature on divergence measures
1. Rigorous Theoretical Analysis: This work rigorously analyzes the distribution characteristics under Best-of-N sampling and establishes its connection with standard RLHF, as well as the specific reward value $r_{BOND}$ under this correlation. This provides a reliable theoretical foundation for the work, rather than being based on naive assumptions. 2. Some Degree of Novelty: Although there is some concurrent work, the idea of distilling distributions from Best-of-N is fairly novel and importa
1. Lack of Important Baselines: Given that the main purpose of the paper is to distill Best-of-N sampling, BoN performance should straightforwardly serve as an important baseline to analyze pros and cons in terms of performance and efficiency. Moreover, other concurrent BoN distillation algorithms [1] should also be considered. 2. Lack of Downstream Validation: The main metrics in the paper, such as reward value and KL divergence, cannot be directly equated to the model's performance on downstr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Control Systems Optimization
