Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic

Hongyi Zhou; Kai Ye; Erhan Xu; Jin Zhu; Ying Yang; Shijin Gong; Chengchun Shi

arXiv:2603.01162·cs.LG·March 24, 2026

Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic

Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Ying Yang, Shijin Gong, Chengchun Shi

PDF

Open Access

TL;DR

This paper reveals that Group Relative Policy Optimization (GRPO) is fundamentally a U-statistic, providing theoretical insights into its error bounds, asymptotic behavior, and optimal group size, supported by empirical validation.

Contribution

It offers a unified U-statistics framework for understanding GRPO, characterizes its statistical properties, and derives a universal scaling law for group size selection.

Findings

01

GRPO's policy gradient is a U-statistic.

02

GRPO achieves asymptotic optimality similar to an oracle policy.

03

The optimal group size is universal across settings.

Abstract

Group relative policy optimization (GRPO), a core methodological component of DeepSeekMath and DeepSeek-R1, has emerged as a cornerstone for scaling reasoning capabilities of large language models. Despite its widespread adoption and the proliferation of follow-up works, the theoretical properties of GRPO remain less studied. This paper provides a unified framework to understand GRPO through the lens of classical U-statistics. We demonstrate that the GRPO policy gradient is inherently a U-statistic, allowing us to characterize its mean squared error (MSE), derive the finite-sample error bound and asymptotic distribution of the suboptimality gap for its learned policy. Our findings reveal that GRPO is asymptotically equivalent to an oracle policy gradient algorithm -- one with access to a value function that quantifies the goodness of its learning policy at each training iteration -- and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics