How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning
Minghao Tian, Yunfei Xie, Chen Wei

TL;DR
Mu-GRPO is a new reinforcement learning framework for large language models that tolerates higher rollout staleness, reducing training overhead while maintaining or improving performance.
Contribution
It introduces Mu-GRPO, a training method that allows larger rollout staleness and reduces overhead, with stabilization techniques like relaxed clipping and negative-advantage veto.
Findings
Mu-GRPO matches or exceeds standard GRPO performance.
Achieves around 2x speedup in training time.
Effective across multiple language models and benchmarks.
Abstract
Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime that incurs substantial system overhead. We ask a simple question: How off-policy can GRPO be? We show that GRPO-style algorithms can tolerate substantially larger rollout staleness than previously assumed, and propose Mu-GRPO, an RL training framework that organizes training into a small number (e.g., four) of large sequential generation-optimization stages. This design induces high rollout staleness while greatly reducing rollout-optimization switching overhead. To stabilize learning under stale data, Mu-GRPO combines relaxed clipping, which preserves useful stale-rollout gradients, with negative-advantage veto, which removes destabilizing post-trigger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
