Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning
Kishan Panaganti, Zhenwen Liang, Wenhao Yu, Haitao Mi, Dong Yu

TL;DR
This paper introduces a dynamic, optimization-based training framework for Large Language Models that adaptively allocates training resources to improve reasoning performance, especially on hard problems, outperforming traditional uniform sampling methods.
Contribution
It proposes Multi-Adversary GDRO with online difficulty classification and two novel GDRO games, providing a more efficient training approach for LLM reasoning tasks.
Findings
Achieved +10.6% and +10.1% relative gains in pass@8 accuracy.
Validated on Qwen3-Base models across multiple scales.
Demonstrated an emergent curriculum shifting focus to harder reasoning tasks.
Abstract
Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Reinforcement Learning in Robotics
