MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

Kangda Wei; Ruihong Huang

arXiv:2601.09085·cs.LG·January 15, 2026

MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

Kangda Wei, Ruihong Huang

PDF

Open Access

TL;DR

This paper introduces MMR-GRPO, a method that uses diversity-aware reward reweighting to accelerate GRPO training, reducing training steps and time without sacrificing performance.

Contribution

The paper proposes MMR-GRPO, integrating Maximal Marginal Relevance to improve training efficiency by emphasizing diverse completions in mathematical reasoning models.

Findings

01

Achieves 47.9% fewer training steps on average.

02

Reduces wall-clock training time by 70.2%.

03

Maintains comparable peak performance across models and benchmarks.

Abstract

Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Constraint Satisfaction and Optimization · Multimodal Machine Learning Applications