DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

Xiwen Chen; Wenhui Zhu; Peijie Qiu; Xuanzhao Dong; Hao Wang; Haiyu Wu; Huayu Li; Aristeidis Sotiras; Yalin Wang; Abolfazl Razi

arXiv:2505.09655·cs.CL·March 3, 2026

DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper introduces DRA-GRPO, a novel method that enhances mathematical reasoning in large language models by promoting diverse reasoning paths through a reward calibration mechanism, leading to improved performance on benchmarks.

Contribution

DRA-GRPO is a new framework that adjusts rewards based on semantic diversity, effectively encouraging varied reasoning strategies in LLM training.

Findings

01

Outperforms strong baselines on five math benchmarks.

02

Achieves 58.2% accuracy with only 7,000 samples.

03

Demonstrates the importance of diversity calibration in data-efficient training.

Abstract

Post-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often non-injective with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a Diversity-Quality Inconsistency, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies. To bridge this gap, we propose Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups. By leveraging Submodular Mutual Information (SMI), DRA implements an Inverse Propensity Scoring (IPS) mechanism that effectively de-biases the gradient estimation. This creates a repulsive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiwenc1/dra-grpo
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Reinforcement Learning in Robotics

MethodsDynamic Range Activator