Why GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive Gradients
Cheng Ge, Caitlyn Heqi Yin, Hao Liang, Jiawei Zhang

TL;DR
This paper explains why normalization improves Group Relative Policy Optimization (GRPO) in reinforcement learning by analyzing local curvature and adaptive gradients, supported by theoretical and empirical results on benchmark datasets.
Contribution
It provides a theoretical analysis of normalization in GRPO from a local curvature perspective and empirically identifies training phases influenced by reward variance and feature orthogonality.
Findings
Normalization acts as an adaptive gradient in GRPO.
Theoretical convergence rate improves with normalization under mild conditions.
Empirical analysis reveals three training phases influenced by reward variance and feature orthogonality.
Abstract
Reinforcement learning (RL) has become a key driver of language model reasoning. Among RL algorithms, Group Relative Policy Optimization (GRPO) is the de facto standard, avoiding the need for a critic by using per-prompt baselines and variance normalization. Yet why and when this normalization helps remains unclear. In this work, we provide an explanation through the lens of local curvature of the sequence-level policy gradient: standard deviation normalization implements an adaptive gradient. Theoretically, under mild conditions, GRPO enjoys a strictly improved convergence rate over unnormalized REINFORCE, with gains characterized by the average within-prompt reward standard deviation across prompts and iterations. Empirically, our analysis on GSM8K and MATH benchmarks reveals three distinct training phases governed by the interplay between feature orthogonality and reward variance:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Domain Adaptation and Few-Shot Learning
