Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning
Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, Shuiwang Ji

TL;DR
This paper introduces E2H Reasoner, a curriculum learning approach that schedules tasks from easy to hard to enhance reasoning in small language models, supported by theoretical guarantees and empirical results.
Contribution
The paper proposes a novel curriculum learning method for RL in LLMs, with convergence guarantees and sample complexity analysis, improving reasoning abilities of small models.
Findings
E2H Reasoner improves reasoning in small LLMs.
Fading easy tasks prevents overfitting.
Theoretical convergence guarantees are established.
Abstract
We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately…
Peer Reviews
Decision·ICLR 2026 Poster
1. The method creatively combines task decomposition with probabilistic scheduling in CRL, addressing rollout inefficiencies in difficult reasoning tasks by building skills incrementally, which makes intuitive sense and extends prior RL post-training like DeepSeek-R1. 2. Theoretical analysis provides finite-sample bounds and convergence guarantees, grounding the approach in approximate policy iteration. 3. Well-structured presentation with illustrative figures (e.g., task decomposition in Fig.
1. Risk of Overfitting in Task Decomposition: Decomposing hard tasks into varying difficulty levels may cause repeated exposure to similar knowledge patterns across subtasks, increasing overfitting risks, especially if subtasks overlap significantly without explicit regularization. 2. Lack of Implementation Details for Reproducibility: Key details are missing, such as prompts used for automatic difficulty estimation (e.g., in AQuA/GSM8K) or exact hyperparameters for task grouping, raising conc
The paper proposes a simple method of using curriculum learning. The curriculum implicitly assumes some grouping of tasks, but they also show that the grouping is not necessary because tasks can be clustered just using pass rates of the initial model. They also compare with different baselines and the empirical results seem sound.
The only weakness that comes to mind is not comparing with DAPO [1] which also has an implicit curriculum because the model keeps filtering prompts that are either too easy or too hard. Could the authors compare with DAPO as well and show results on the benchmarks? Also the paper doesn't cite Paprika [2] which also proposes a curriculum when tasks can be grouped. [1] DAPO: An Open-Source LLM Reinforcement Learning System at Scale (https://arxiv.org/abs/2503.14476) [2] Training a Generally
1. The paper provides theoretical justification for why CRL can achieve sample efficiency, requiring fewer total samples than direct learning on the final task. 2. The experimental results are sound and well-presented.
1. The idea of using curriculum learning to improve RL efficiency is not novel. The paper acknowledged prior work—e.g., Chen et al., Foster et al., Bae et al., Zeng et al. which used curriculum learning ideas. The paper should also cite Yu et al. (DAPO: An Open-Source LLM Reinforcement Learning System at Scale). 2. In the experimental results, E2H does not consistently outperform baselines such as GRPO or Self-Evolve. 3. The paper does not clearly articulate the advantages of E2H over adaptive
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
