Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

Xixiang He; Qiyao Sun; Ao Cheng; Xingming Li; Xuanyu Ji; Hailun Lu; Runke Huang; Qingyong Hu

arXiv:2605.21125·cs.LG·May 21, 2026

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

Xixiang He, Qiyao Sun, Ao Cheng, Xingming Li, Xuanyu Ji, Hailun Lu, Runke Huang, Qingyong Hu

PDF

1 Repo

TL;DR

This paper identifies advantage collapse in Group Relative Policy Optimization (GRPO), introduces a diagnostic metric (ACR), and proposes AVSPO, a method that mitigates advantage collapse and improves performance in large language models.

Contribution

The paper introduces ACR as a diagnostic tool for advantage collapse and proposes AVSPO, a lightweight extension that mitigates advantage collapse in GRPO.

Findings

01

ACR strongly predicts training stagnation and final performance.

02

AVSPO reduces advantage collapse by 58-63%.

03

AVSPO improves accuracy by 4-6 percentage points across model scales.

Abstract

Group Relative Policy Optimization (GRPO), a prominent algorithm within the Reinforcement Learning from Verifiable Rewards (RLVR) framework, has achieved strong results in improving the reasoning capabilities of large language models (LLMs). However, GRPO is prone to advantage collapse, a failure mode where homogeneous rewards within a group (e.g., all correct or all incorrect answers) yield near-zero advantages and vanishing gradients. To address this, we introduce the Advantage Collapse Rate (ACR), the first diagnostic metric quantifying the proportion of training batches with ineffective gradients. Across models from 0.5B to 14B parameters on mathematical reasoning benchmarks, we show that ACR strongly predicts training stagnation and final performance. We then propose Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight extension of GRPO that injects virtual reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://qingyonghu.github.io/AVSPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.