Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

Christos Thrampoulidis; Sadegh Mahdavi; Wenlong Deng

arXiv:2510.23049·cs.LG·March 24, 2026

Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

Christos Thrampoulidis, Sadegh Mahdavi, Wenlong Deng

PDF

TL;DR

This paper unifies two approaches to policy gradient optimization for Pass@K in reinforcement learning, showing they are fundamentally connected and can be derived from surrogate reward maximization principles.

Contribution

It reveals that advantage-shaping techniques implicitly optimize surrogate rewards and offers a unified framework for deriving new and existing advantage-shaping methods.

Findings

01

Advantage-shaping methods can be interpreted as surrogate reward optimization.

02

A simple recipe for deriving advantage-shaping algorithms from surrogate rewards.

03

The perspective extends beyond Pass@K to broader RLVR policy gradient optimization.

Abstract

This note reconciles two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards: (1) direct REINFORCE-style methods, and (2) advantage-shaping techniques that directly modify GRPO. We show that these are two sides of the same coin. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical "hard-example up-weighting" modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods. This perspective provides a lens for RLVR policy gradient optimization beyond our original motivation of Pass@K.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.