Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works

Wenhua Nie; Jianan Wu; Junlin Liu; Ziwei Li; Zheng Lin; Zhang Zijian; Yilong Fan; Haoran Zheng; Jyh-Shing Roger Jang

arXiv:2605.07689·cs.LG·May 11, 2026

Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works

Wenhua Nie, Jianan Wu, Junlin Liu, Ziwei Li, Zheng Lin, Zhang Zijian, Yilong Fan, Haoran Zheng, Jyh-Shing Roger Jang

PDF

TL;DR

This paper identifies gradient starvation as a failure mode in group-mean-centered advantage methods for binary-reward reinforcement learning and proposes a simple fix that significantly improves performance.

Contribution

The authors analyze the failure mode of group-mean advantage in binary rewards and introduce a fixed-reference Sign advantage that mitigates gradient starvation.

Findings

01

Sign advantage achieves 73.8% accuracy on GSM8K test set, outperforming standard methods.

02

Gradient starvation occurs with a degeneracy rate of 0.69 at group size four.

03

The proposed fix improves search efficiency rather than capacity.

Abstract

Group Relative Policy Optimization (GRPO) is a standard algorithm for reinforcement learning from verifiable rewards, but its group-mean-centered advantage can fail under binary rewards. The failure mode is gradient starvation: when every response in a group is correct or every response is wrong, the centered advantage is exactly zero and the policy receives no learning signal. We prove that the true degeneracy rate always exceeds the i.i.d. Bernoulli prediction by Jensen's inequality, and observe a 0.69 degeneracy rate at group size four in logged Qwen3.5-9B GSM8K training. We then show that the fixed-reference Sign advantage, $A = 2 r - 1$ , performs pass@ $G$ failure descent by increasing the probability that at least one sample in the group succeeds. On the full GSM8K test set across seven seeds, Sign reaches 73.8% accuracy versus 28.4% for standard normalized group-mean DrGRPO at group…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.