Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs
Suraj Yadav, Siddharth Yadav, Parth Goyal

TL;DR
This paper investigates the limits of difficulty scaling in large language models, showing that hard samples yield diminishing returns in accuracy improvements when using GRPO with LoRA, especially beyond a capacity boundary.
Contribution
It demonstrates that training on lower-difficulty problems can match full-dataset performance with fewer steps and reveals the influence of model prior and dataset difficulty on gains.
Findings
Accuracy plateaus as problem difficulty increases.
Training on easier problems achieves similar accuracy with fewer steps.
Cross-dataset generalization shows models trained on GSM8K outperform MATH-trained models on numeric subsets.
Abstract
Recent alignment work on Large Language Models (LLMs) suggests preference optimization can improve reasoning by shifting probability mass toward better solutions. We test this claim in a resource-constrained setting by applying GRPO with LoRA to SLMs (up to 3B) for math reasoning on GSM8K and MATH datasets with difficulty-stratified analyses. As problem difficulty increases, accuracy plateaus, revealing a capacity boundary: GRPO primarily reshapes output preferences without reliably improving hardest-tier solving. Consistent with this, training GRPO only on lower-difficulty problems matches full-dataset accuracy across difficulty tiers while using only ~45% training steps, indicating diminishing returns from harder samples in this regime. We also find a cross-dataset generalization effect: GSM8K-trained GRPO achieves higher accuracy on the numeric subset of MATH than MATH-trained GRPO,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
