On the Hidden Objective Biases of Group-based Reinforcement Learning
Aleksandar Fontana, Marco Simoni, Giulio Rossolini, Andrea Saracino, Paolo Mori

TL;DR
This paper provides a theoretical analysis of group-based reinforcement learning methods like GRPO, revealing inherent biases and limitations that affect training dynamics and policy optimization.
Contribution
It introduces a unified surrogate formulation to analyze GRPO methods, uncovering systematic biases and interactions with optimizers that impact training.
Findings
Non-uniform group weighting causes gradient biases.
Interactions with AdamW reduce sensitivity to reward scaling.
Optimizer momentum can lead to policy updates beyond intended clipping.
Abstract
Group-based reinforcement learning methods, like Group Relative Policy Optimization (GRPO), are widely used nowadays to post-train large language models. Despite their empirical success, they exhibit structural mismatches between reward optimization and the underlying training objective. In this paper, we present a theoretical analysis of GRPO style methods by studying them within a unified surrogate formulation. This perspective reveals recurring properties that affect all the methods under analysis: (i) non-uniform group weighting induces systematic gradient biases on shared prefix tokens; (ii) interactions with the AdamW optimizer make training dynamics largely insensitive to reward scaling; and (iii) optimizer momentum can push policy updates beyond the intended clipping region under repeated optimization steps. We believe that these findings highlight fundamental limitations of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Natural Language Processing Techniques · Topic Modeling
