Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic
Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Ying Yang, Shijin Gong, Chengchun Shi

TL;DR
This paper reveals that Group Relative Policy Optimization (GRPO) is fundamentally a U-statistic, providing theoretical insights into its error bounds, asymptotic behavior, and optimal group size, supported by empirical validation.
Contribution
It offers a unified U-statistics framework for understanding GRPO, characterizes its statistical properties, and derives a universal scaling law for group size selection.
Findings
GRPO's policy gradient is a U-statistic.
GRPO achieves asymptotic optimality similar to an oracle policy.
The optimal group size is universal across settings.
Abstract
Group relative policy optimization (GRPO), a core methodological component of DeepSeekMath and DeepSeek-R1, has emerged as a cornerstone for scaling reasoning capabilities of large language models. Despite its widespread adoption and the proliferation of follow-up works, the theoretical properties of GRPO remain less studied. This paper provides a unified framework to understand GRPO through the lens of classical U-statistics. We demonstrate that the GRPO policy gradient is inherently a U-statistic, allowing us to characterize its mean squared error (MSE), derive the finite-sample error bound and asymptotic distribution of the suboptimality gap for its learned policy. Our findings reveal that GRPO is asymptotically equivalent to an oracle policy gradient algorithm -- one with access to a value function that quantifies the goodness of its learning policy at each training iteration -- and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics
