MC-GRPO: Median-Centered Group Relative Policy Optimization for Small-Rollout Reinforcement Learning

Youngeun Kim

arXiv:2601.22582·cs.LG·February 2, 2026

MC-GRPO: Median-Centered Group Relative Policy Optimization for Small-Rollout Reinforcement Learning

Youngeun Kim

PDF

Open Access 1 Models

TL;DR

MC-GRPO introduces a median-based reward normalization technique for small-rollout reinforcement learning, significantly improving stability and accuracy by reducing the impact of reward outliers.

Contribution

The paper proposes replacing the mean reward baseline with a median baseline in group-relative policy optimization to enhance stability in resource-constrained settings.

Findings

01

Median baseline reduces sign flips in advantage estimation.

02

MC-GRPO improves accuracy in low-rollout regimes.

03

Performance gap between small and larger rollouts is minimized.

Abstract

Group-relative policy optimization methods train language models by generating multiple rollouts per prompt and normalizing rewards with a shared mean reward baseline. In resource-constrained settings where the rollout budget is small, accuracy often degrades. We find that noise in the shared baseline induces advantage sign flips, where some rollouts receive an incorrect advantage sign, and the update direction is reversed. To address this, we propose Median-Centered Group Relative Policy Optimization (MC-GRPO), a simple and effective solution for small-rollout training. Our main idea is to replace the mean baseline with a median baseline: the median is far less sensitive to outlier rewards than the mean, mitigating the sign flips under small rollout size (G). We generate one additional rollout for median reference (G+1), and compute advantages by using the group median. With an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
JayLuci4/chronos-poc
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning