RSPO: Risk-Seeking Policy Optimization for Pass@k and Max@k Metrics in Large Language Models

Kaichen Zhang; Shenghao Gao; Yuzhong Hong; Haipeng Sun; Junwei Bao; Hongfei Jiang; Yang Song; Hong Dingqian; Hui Xiong

arXiv:2508.01174·cs.LG·August 5, 2025

RSPO: Risk-Seeking Policy Optimization for Pass@k and Max@k Metrics in Large Language Models

Kaichen Zhang, Shenghao Gao, Yuzhong Hong, Haipeng Sun, Junwei Bao, Hongfei Jiang, Yang Song, Hong Dingqian, Hui Xiong

PDF

Open Access

TL;DR

RSPO introduces a novel training method for large language models that directly optimizes risk-seeking metrics like Pass@k and Max@k, addressing the mismatch between training objectives and evaluation metrics.

Contribution

It proposes RSPO, a new approach that effectively optimizes Pass@k and Max@k metrics by overcoming the hitchhiking problem with unbiased gradient estimators.

Findings

01

RSPO achieves superior performance on Pass@k and Max@k metrics.

02

The method provides unbiased gradient estimates despite complex nested gradients.

03

Theoretical analysis confirms the effectiveness of RSPO in large language models.

Abstract

Current large language model post-training optimizes a risk-neutral objective that maximizes expected reward, yet evaluation relies heavily on risk-seeking metrics like Pass@k (at least one success in k trials) and Max@k (maximum reward across k responses). This mismatch in risk preferences can inevitably lead to suboptimal performance. To bridge this gap, we propose Risk-Seeking Policy Optimization (RSPO), a novel method that directly targets Pass@k and Max@k during training. A key challenge in optimizing these metrics is the "hitchhiking" problem: low-reward responses are inadvertently reinforced if they co-occur with a high-reward response within a sample of k generations, resulting in inefficient optimization. RSPO addresses this problem by leveraging the closed-form probability that a given response is the maximum among k samplings. Despite the complexity of nested gradients over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Multimodal Machine Learning Applications