MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
Kangda Wei, Ruihong Huang

TL;DR
This paper introduces MMR-GRPO, a method that uses diversity-aware reward reweighting to accelerate GRPO training, reducing training steps and time without sacrificing performance.
Contribution
The paper proposes MMR-GRPO, integrating Maximal Marginal Relevance to improve training efficiency by emphasizing diverse completions in mathematical reasoning models.
Findings
Achieves 47.9% fewer training steps on average.
Reduces wall-clock training time by 70.2%.
Maintains comparable peak performance across models and benchmarks.
Abstract
Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Constraint Satisfaction and Optimization · Multimodal Machine Learning Applications
