Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening
Andre He, Daniel Fried, Sean Welleck

TL;DR
This paper identifies a bias in the standard reinforcement learning algorithm GRPO that favors common solutions, and introduces a new reward method to promote rare correct solutions, improving formal theorem proving performance.
Contribution
The paper reveals a rank bias in GRPO and proposes unlikeliness reward to explicitly up-weight rare solutions, enhancing reasoning capabilities in language models.
Findings
Unlikeliness reward mitigates rank bias and improves pass@$N$ in theorem proving.
A revised GRPO training recipe achieves competitive results on miniF2F.
Hyperparameter tuning related to batch updates affects rank bias and model performance.
Abstract
Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities. A fundamental question is whether current reinforcement learning algorithms -- such as Group Relative Policy Optimization (GRPO), the de facto standard algorithm used to improve language model reasoning -- merely sharpen the base model's distribution around problems it can already solve. We investigate this question in the context of formal theorem proving, which has access to a perfect verifier. We identify a degenerate rank bias in GRPO in which highly probable trajectories are reinforced and rare ones are neglected. This results in distribution sharpening: the model can solve some problems with fewer samples, but underperforms simply sampling more solutions from the original model. To overcome GRPO's rank bias we introduce unlikeliness reward, a simple method for explicitly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsPublic Procurement and Policy · Census and Population Estimation · Healthcare Policy and Management
MethodsEntropy Regularization · Proximal Policy Optimization
