Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Yanqi Dai; Yuxiang Ji; Xiao Zhang; Yong Wang; Xiangxiang Chu; Zhiwu Lu

arXiv:2601.20614·cs.AI·January 29, 2026

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu

PDF

Open Access 3 Datasets 3 Reviews

TL;DR

MathForge enhances mathematical reasoning in large models by systematically focusing on harder questions through difficulty-aware optimization and multi-aspect question reformulation, leading to significant performance improvements.

Contribution

Introduces a dual framework combining difficulty-aware policy optimization and multi-aspect question reformulation to better target challenging questions in mathematical reasoning tasks.

Findings

01

Outperforms existing methods on multiple mathematical reasoning benchmarks.

02

Effectively balances learning from easy and hard questions.

03

Demonstrates the benefit of difficulty-aware data augmentation in reasoning models.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities. Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. To address these issues, we propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The paper makes several notable contributions. First, it provides rigorous mathematical analysis with formal proofs (Theorems 1 and 2) demonstrating that GRPO's advantage estimation creates an inherent bias favoring moderate-difficulty questions over harder ones, which is a valuable theoretical insight into a widely-used algorithm. Second, the proposed solutions are well-designed and complementary: DGPO corrects the algorithmic bias while MQR enriches training data with harder questions, creatin

Weaknesses

Despite its strengths, the paper has several limitations that warrant consideration. The central premise that prioritizing harder questions is universally beneficial lacks deeper justification—while GRPO under-weights them, the paper doesn't definitively establish when or why this is detrimental, or whether all types of difficulty contribute equally to learning. The MQR approach's reliance on OpenAI o3, an exceptionally powerful and potentially expensive model, raises concerns about reproducibil

Reviewer 02Rating 2Confidence 3

Strengths

1. The idea of DQW is intuitive and reasonable. Compared to the advantage-reweighting for difficulty technique used in arxiv:2504.09696, DQW is simpler and has fewer hyperparameters for tuning. 2. The evaluation is comprehensive in the aspect that it is done on various benchmarks, including AMC, AIME, MATH500 and the Olympiad.

Weaknesses

1. The three techniques introduced in the paper are all not completely novel. The biased estimation issue is well-known in the community and has been handled in the Dr.GRPO work. The difficulty-aware policy optimization is also a topic already touched in arxiv:2504.09696. The Multi-Aspect Question Reformulation (MQR) strategy belongs to question augmentation methods via LLMs that have been investigated in depth since arxiv:2210.11610. 2. The theoretical analysis of DGAE is not convincing to me

Reviewer 03Rating 6Confidence 4

Strengths

1. This paper proves that traditional GRPO has an implicit bias and could be resolve with DGPO in small modification. 2. This paper propose a data synthesis method MRQ which shows performance improvement.

Weaknesses

Lack of depth analysis between GRPO and DGPO under different settings, e.g. queries in different difficulties, different types of datasets. More is discussed on questions section.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Intelligent Tutoring Systems and Adaptive Learning · Topic Modeling