Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Christian Walder; Deep Karkhanis

arXiv:2505.15201·cs.LG·December 16, 2025

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Christian Walder, Deep Karkhanis

PDF

Open Access 4 Reviews

TL;DR

This paper introduces Pass-at-k Policy Optimization (PKPO), a novel RL method that directly optimizes set-based success metrics for any k, improving exploration and performance on harder problems.

Contribution

We derive low variance unbiased estimators for pass@k, enabling robust optimization for any k, and demonstrate improved exploration and performance in RL tasks.

Findings

01

PKPO effectively optimizes for target k in RL.

02

Higher k values enable solving more difficult problems.

03

Annealing k during training improves both pass@1 and pass@k.

Abstract

Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a…

Peer Reviews

Decision·NeurIPS 2025 spotlight

Reviewer 01Rating 5Confidence 4

Strengths

**Summary:** the theoretical contribution of this work is clearly strong; the main crux is whether the much weaker empirical section is sufficient to justify acceptance. In this reviewer's opinion, this paper does not meet the bar for the strength of empirical support in RL-for-LMs theoretical papers published at NeurIPS, hence Borderline Reject. However, I'd happily change my rating to Accept with the addition of several new experiments (see the Weaknesses section). ## Strengths The paper is

Reviewer 02Rating 4Confidence 1

Strengths

This paper addresses an important problem in reinforcement learning: how to effectively optimize when multiple solution attempts per problem are sampled. The core idea of directly optimizing for pass@k is valuable. However, the proposed method suffers from several fundamental methodological and computational issues that undermine its theoretical claims and practical scalability. **Strengths:** * The core insight of directly optimizing pass@k rather than pass@1 is valuable, as it addresses a re

Reviewer 03Rating 4Confidence 2

Strengths

**Strengths:** - The pass@k optimization paradigm is relevant and increasingly important in RL for generative models. This paper contributes to it by introducing a general formulation (arbitrary k≤n), supporting it with rigorous mathematical derivations, and providing empirical validation. - The paper is well written. The mathematical content is clearly presented, and the proofs are systematically included, either in the main text or in the appendix. - The experimental section includes a toy pro

Reviewer 04Rating 5Confidence 4

Strengths

Strengths: 1. The paper is clearly presented and easy to follow without extraneous content 2. The paper presents a timely and important contribution. The implicit misalignment between the pass@1 training and pass@k usage of models leaves much on the table. The paper's contribution, to provide a more general (and practical) reward transformation to support the direct pass@k, is, to my mind, a clearly valuable contribution with immediate application. Weaknesses: 1. There is likely some missing li

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Scheduling and Optimization Algorithms · Transportation and Mobility Innovations