Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

Zhipeng Chen; Xiaobo Qin; Youbin Wu; Yue Ling; Qinghao Ye; Wayne Xin Zhao; Guang Shi

arXiv:2508.10751·cs.LG·August 15, 2025

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, Guang Shi

PDF

3 Reviews

TL;DR

This paper introduces Pass@k Training, a novel approach that uses Pass@k as a reward in reinforcement learning for large language models, improving exploration and balancing it with exploitation.

Contribution

It presents the first use of Pass@k as a reward in RLVR, derives an analytical advantage function, and demonstrates how exploration and exploitation can mutually enhance each other.

Findings

01

Pass@k Training improves exploration ability of large models.

02

Analytical advantage derivation leads to efficient training.

03

Exploration and exploitation can mutually reinforce each other.

Abstract

Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., $Pass@k Training$ ), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

Originality - The analytical derivation of advantage estimates (Equations 8-9) is an original technical contribution, providing closed-form solutions that eliminate sampling variance and reduce computational overhead compared to full sampling. Quality - Experiments have been presented across diverse domains, including maze navigation with varying sizes, mathematical reasoning tasks, and synthetic puzzles. - The paper presents diversity metrics and policy entropy measurements showing Pass@k T

Weaknesses

1. The connection between the advantage function design presented in Section 4 and optimality guarantees is not established. 2. All experiments use outcome-based verification with binary rewards, and applicability to process-based rewards, partial credit, or continuous reward signals remains unexplored. 3. The computational cost analysis is incomplete. Actual wall-clock time comparisons across all methods and scales are not provided. Depending on k, Pass@k may significantly increase the wall-cl

Reviewer 02Rating 2Confidence 3

Strengths

The paper focuses on an important issue in reinforcement learning for large reasoning models—the exploration–exploitation trade-off under verifiable rewards. Its motivation is clear, and the proposed Pass@k Training framework provides a systematic three-stage implementation (full sampling → bootstrap sampling → analytical derivation). The presentation of ablations on different k values and learning rates demonstrates a good empirical sense of robustness. From a methodological perspective, the an

Weaknesses

1. The paper shows strong similarity to Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems (DeepMind, 2025) in terms of research motivation (limitations of Pass@1 exploration), core idea (training with Pass@k as the reward), key formulation (Eq. 11), and experimental validation (different tasks but similar evaluation goals and performance trends). However, the prior work is only briefly mentioned in the related-work section without any detailed analytical or empirical com

Reviewer 03Rating 4Confidence 4

Strengths

- This method achieves significant performance improvements with a relatively lightweight modification to the RLVR training process. The gains are not only in performance but also in sample diversity, as measured by answer diversity and policy entropy . - The final "Analytical Derivation" method is technically solid, supported by a clear theoretical derivation. - The experiments are conducted across a wide range of domains, which demonstrates the generalizability and robustness of the method.

Weaknesses

- The novelty of this paper seems to be insufficient. The core motivation and the primary solution appear to be similar with existing research, namely Pass@K Policy Optimization. Besides the effective "annealing" strategy of starting with k>1 and decaying to k=1 is also explored in Pass@K Policy Optimization. - The implement only adopts GRPO as its training algorithm, the generalizability of the proposed method to other classic RLVR algorithm is not verified.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.