M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

Bizhe Bai; Hongming Wu; Peng Ye; Tao Chen

arXiv:2512.13070·cs.AI·December 16, 2025

M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

Bizhe Bai, Hongming Wu, Peng Ye, Tao Chen

PDF

Open Access

TL;DR

This paper introduces M-GRPO, a momentum-based reinforcement learning framework with an adaptive filtering technique to stabilize training and improve reasoning capabilities of large language models without human annotations.

Contribution

The paper proposes M-GRPO with a momentum model for stable training and an IQR-based filter to maintain policy diversity, addressing collapse issues in self-supervised RL for LLMs.

Findings

01

M-GRPO stabilizes training in large language models.

02

The IQR filter prevents premature policy collapse.

03

Achieves state-of-the-art results on reasoning benchmarks.

Abstract

Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs) without reliance on expensive human-annotated data. However, we find that existing methods suffer from a critical failure mode under long-horizon training: a "policy collapse" where performance precipitously degrades. We diagnose this instability and demonstrate that simply scaling the number of rollouts -- a common strategy to improve performance -- only delays, but does not prevent, this collapse. To counteract this instability, we first introduce M-GRPO (Momentum-Anchored Group Relative Policy Optimization), a framework that leverages a slowly evolving momentum model to provide a stable training target. In addition, we identify that this process is often accompanied by a rapid collapse in policy entropy, resulting in a prematurely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning