Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training

Youssef Mroueh; Nicolas Dupuis; Brian Belgodere; Apoorva Nitsure; Mattia Rigotti; Kristjan Greenewald; Jiri Navratil; Jerret Ross; Jesus Rios

arXiv:2505.22257·cs.LG·June 2, 2025

Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training

Youssef Mroueh, Nicolas Dupuis, Brian Belgodere, Apoorva Nitsure, Mattia Rigotti, Kristjan Greenewald, Jiri Navratil, Jerret Ross, Jesus Rios

PDF

Open Access 3 Reviews

TL;DR

This paper revisits Group Relative Policy Optimization (GRPO) in on-policy and off-policy settings, demonstrating that off-policy GRPO can outperform or match on-policy performance, with implications for improved reinforcement learning stability and efficiency.

Contribution

The paper adapts GRPO to the off-policy setting, showing its effectiveness and comparing it empirically to on-policy GRPO, highlighting advantages in training stability and reward improvement.

Findings

01

Off-policy GRPO significantly outperforms or matches on-policy performance.

02

Clipped surrogate objectives enhance off-policy GRPO training.

03

Empirical results confirm the benefits of off-policy GRPO in reinforcement learning.

Abstract

We revisit Group Relative Policy Optimization (GRPO) in both on-policy and off-policy optimization regimes. Our motivation comes from recent work on off-policy Proximal Policy Optimization (PPO), which improves training stability, sampling efficiency, and memory usage. In addition, a recent analysis of GRPO suggests that estimating the advantage function with off-policy samples could be beneficial. Building on these observations, we adapt GRPO to the off-policy setting. We show that both on-policy and off-policy GRPO objectives yield an improvement in the reward. This result motivates the use of clipped surrogate objectives in the off-policy version of GRPO. We then compare the empirical performance of reinforcement learning with verifiable rewards in post-training using both GRPO variants. Our results show that off-policy GRPO either significantly outperforms or performs on par with…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- An explicit derivation of a reward-improvement lower bound for GRPO (on/off-policy) and a principled route from KL-constraint to a clipped surrogate objective. - The $(i,v)$ design is simple and addresses a real bottleneck (communication/model updates) in multi-GPU/TP training. - The analysis explains why zero-variance reward samples degrade constants and how masking can stabilize training in verifiable-reward tasks. - Results on reasoning benchmarks show that the off-policy variant is c

Weaknesses

- Most comparisons are within the GRPO family (on vs off; with/without masking). The paper should include controlled head-to-head comparisons with **DAPO / DR-GRPO / OPPO / ToPPO / DPO / IPO** under matched rewarders, sampling budgets, and compute. And the 7B setting reports system speed but lacks full accuracy curves and significance tests. Sensitivity to temperature, context length, $\beta$, group size $G$, and $(i,v)$ needs more systematic coverage. - Evidence is concentrated on verifiable

Reviewer 02Rating 6Confidence 3

Strengths

1. **Theoretical Rigor and Novelty** * Provides the **first formal policy improvement bound for GRPO**, including both on- and off-policy cases. * The proofs elegantly extend beyond standard MDP-based analyses, avoiding dependence on state visitation distributions and instead exploiting GRPO’s analytical advantage form. * The derivation of **clipped surrogate objectives** for off-policy GRPO generalizes the PPO framework in a principled way. 2. **Practical Impact for LLM Training**

Weaknesses

1. **Limited Experimental Diversity** * Experiments focus primarily on math datasets (GSM8K, DeepScaleR). Broader testing on reasoning, code generation, or instruction-following datasets would strengthen generality. * The number of random seeds (three) is relatively small, and there’s little reporting of variance beyond Pass@1 standard deviation. 2. **Theory–Practice Gap** * While the proofs are strong, some constants in the policy improvement bound (e.g., the variance-dependent term

Reviewer 03Rating 2Confidence 3

Strengths

The paper analyzes GRPO, a recent, simple, and efficient algorithm for training LLMs. The authors, inspired by recent off-policy variants of PPO, propose an Off-Policy variant of GRPO. They also propose a policy improvement bound that covers both off-policy and on-policy development.

Weaknesses

First, let me write here that I am not familiar with the LLM literature. However, I am very familiar with the reinforcement learning literature and understand GRPO and PPO well. I scored my own assessment of the paper with _confidence=3_ for this reason. Clarity -------- The paper's presentation, in my opinion, is quite confusing. Let me state why: * As far as I understand, the central motivation of the paper revolves around the development of a decentralized update scheme of GRPO. This point

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Resource Development and Performance Evaluation