TL;DR
This paper identifies advantage collapse in Group Relative Policy Optimization (GRPO), introduces a diagnostic metric (ACR), and proposes AVSPO, a method that mitigates advantage collapse and improves performance in large language models.
Contribution
The paper introduces ACR as a diagnostic tool for advantage collapse and proposes AVSPO, a lightweight extension that mitigates advantage collapse in GRPO.
Findings
ACR strongly predicts training stagnation and final performance.
AVSPO reduces advantage collapse by 58-63%.
AVSPO improves accuracy by 4-6 percentage points across model scales.
Abstract
Group Relative Policy Optimization (GRPO), a prominent algorithm within the Reinforcement Learning from Verifiable Rewards (RLVR) framework, has achieved strong results in improving the reasoning capabilities of large language models (LLMs). However, GRPO is prone to advantage collapse, a failure mode where homogeneous rewards within a group (e.g., all correct or all incorrect answers) yield near-zero advantages and vanishing gradients. To address this, we introduce the Advantage Collapse Rate (ACR), the first diagnostic metric quantifying the proportion of training batches with ineffective gradients. Across models from 0.5B to 14B parameters on mathematical reasoning benchmarks, we show that ACR strongly predicts training stagnation and final performance. We then propose Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight extension of GRPO that injects virtual reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
