Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

Andre He; Daniel Fried; Sean Welleck

arXiv:2506.02355·cs.LG·June 23, 2025

Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

Andre He, Daniel Fried, Sean Welleck

PDF

Open Access 1 Video

TL;DR

This paper identifies a bias in the standard reinforcement learning algorithm GRPO that favors common solutions, and introduces a new reward method to promote rare correct solutions, improving formal theorem proving performance.

Contribution

The paper reveals a rank bias in GRPO and proposes unlikeliness reward to explicitly up-weight rare solutions, enhancing reasoning capabilities in language models.

Findings

01

Unlikeliness reward mitigates rank bias and improves pass@$N$ in theorem proving.

02

A revised GRPO training recipe achieves competitive results on miniF2F.

03

Hyperparameter tuning related to batch updates affects rank bias and model performance.

Abstract

Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities. A fundamental question is whether current reinforcement learning algorithms -- such as Group Relative Policy Optimization (GRPO), the de facto standard algorithm used to improve language model reasoning -- merely sharpen the base model's distribution around problems it can already solve. We investigate this question in the context of formal theorem proving, which has access to a perfect verifier. We identify a degenerate rank bias in GRPO in which highly probable trajectories are reinforced and rare ones are neglected. This results in distribution sharpening: the model can solve some problems with fewer samples, but underperforms simply sampling more solutions from the original model. To overcome GRPO's rank bias we introduce unlikeliness reward, a simple method for explicitly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening· underline

Taxonomy

TopicsPublic Procurement and Policy · Census and Population Estimation · Healthcare Policy and Management

MethodsEntropy Regularization · Proximal Policy Optimization