The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

Farid Bagirov; Mikhail Arkhipov; Ksenia Sycheva; Evgeniy Glukhov; Egor Bogomolov

arXiv:2510.23393·cs.LG·October 28, 2025

The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, Egor Bogomolov

PDF

TL;DR

This paper introduces a novel method to optimize the max@k metric in reinforcement learning, improving model performance during Best-of-N sampling by deriving unbiased gradient estimates and extending to off-policy updates.

Contribution

It presents the first unbiased on-policy gradient estimate for max@k optimization and extends it to off-policy updates, enhancing RLVR for better Best-of-N inference.

Findings

01

Effective optimization of max@k in off-policy scenarios

02

Improved alignment with Best-of-N sampling strategy

03

Enhanced sample efficiency in RLVR

Abstract

The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.