How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

Minghao Tian; Yunfei Xie; Chen Wei

arXiv:2605.17570·cs.LG·May 19, 2026

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

Minghao Tian, Yunfei Xie, Chen Wei

PDF

TL;DR

Mu-GRPO is a new reinforcement learning framework for large language models that tolerates higher rollout staleness, reducing training overhead while maintaining or improving performance.

Contribution

It introduces Mu-GRPO, a training method that allows larger rollout staleness and reduces overhead, with stabilization techniques like relaxed clipping and negative-advantage veto.

Findings

01

Mu-GRPO matches or exceeds standard GRPO performance.

02

Achieves around 2x speedup in training time.

03

Effective across multiple language models and benchmarks.

Abstract

Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime that incurs substantial system overhead. We ask a simple question: How off-policy can GRPO be? We show that GRPO-style algorithms can tolerate substantially larger rollout staleness than previously assumed, and propose Mu-GRPO, an RL training framework that organizes training into a small number (e.g., four) of large sequential generation-optimization stages. This design induces high rollout staleness while greatly reducing rollout-optimization switching overhead. To stabilize learning under stale data, Mu-GRPO combines relaxed clipping, which preserves useful stale-rollout gradients, with negative-advantage veto, which removes destabilizing post-trigger…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.