Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments

Bryan L. M. de Oliveira; Felipe V. Frujeri; Marcos P. C. M. Queiroz; Luana G. B. Martins; Telma W. de L. Soares; Luckeciano C. Melo

arXiv:2511.03527·cs.LG·November 6, 2025

Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments

Bryan L. M. de Oliveira, Felipe V. Frujeri, Marcos P. C. M. Queiroz, Luana G. B. Martins, Telma W. de L. Soares, Luckeciano C. Melo

PDF

Open Access

TL;DR

This paper systematically evaluates Group Relative Policy Optimization (GRPO) in classical reinforcement learning, revealing that critics are crucial for long-horizon tasks and highlighting conditions where critic-free methods are viable.

Contribution

First comprehensive study of GRPO in classical RL environments, identifying when critic-free approaches succeed or fail compared to traditional methods.

Findings

01

Critics are essential for long-horizon tasks like HalfCheetah.

02

GRPO performs well with high discount factors in certain environments.

03

Smaller group sizes outperform larger ones in batch grouping strategies.

Abstract

Group Relative Policy Optimization (GRPO) has emerged as a scalable alternative to Proximal Policy Optimization (PPO) by eliminating the learned critic and instead estimating advantages through group-relative comparisons of trajectories. This simplification raises fundamental questions about the necessity of learned baselines in policy-gradient methods. We present the first systematic study of GRPO in classical single-task reinforcement learning environments, spanning discrete and continuous control tasks. Through controlled ablations isolating baselines, discounting, and group sampling, we reveal three key findings: (1) learned critics remain essential for long-horizon tasks: all critic-free baselines underperform PPO except in short-horizon environments like CartPole where episodic returns can be effective; (2) GRPO benefits from high discount factors (gamma = 0.99) except in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research