Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients
Christos Thrampoulidis, Sadegh Mahdavi, Wenlong Deng

TL;DR
This paper unifies two approaches to policy gradient optimization for Pass@K in reinforcement learning, showing they are fundamentally connected and can be derived from surrogate reward maximization principles.
Contribution
It reveals that advantage-shaping techniques implicitly optimize surrogate rewards and offers a unified framework for deriving new and existing advantage-shaping methods.
Findings
Advantage-shaping methods can be interpreted as surrogate reward optimization.
A simple recipe for deriving advantage-shaping algorithms from surrogate rewards.
The perspective extends beyond Pass@K to broader RLVR policy gradient optimization.
Abstract
This note reconciles two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards: (1) direct REINFORCE-style methods, and (2) advantage-shaping techniques that directly modify GRPO. We show that these are two sides of the same coin. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical "hard-example up-weighting" modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods. This perspective provides a lens for RLVR policy gradient optimization beyond our original motivation of Pass@K.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
