Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

Shubham Parashar; Shurui Gui; Xiner Li; Hongyi Ling; Sushil Vemuri; Blake Olson; Eric Li; Yu Zhang; James Caverlee; Dileep Kalathil; Shuiwang Ji

arXiv:2506.06632·cs.LG·March 17, 2026

Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, Shuiwang Ji

PDF

Open Access 3 Reviews

TL;DR

This paper introduces E2H Reasoner, a curriculum learning approach that schedules tasks from easy to hard to enhance reasoning in small language models, supported by theoretical guarantees and empirical results.

Contribution

The paper proposes a novel curriculum learning method for RL in LLMs, with convergence guarantees and sample complexity analysis, improving reasoning abilities of small models.

Findings

01

E2H Reasoner improves reasoning in small LLMs.

02

Fading easy tasks prevents overfitting.

03

Theoretical convergence guarantees are established.

Abstract

We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. The method creatively combines task decomposition with probabilistic scheduling in CRL, addressing rollout inefficiencies in difficult reasoning tasks by building skills incrementally, which makes intuitive sense and extends prior RL post-training like DeepSeek-R1. 2. Theoretical analysis provides finite-sample bounds and convergence guarantees, grounding the approach in approximate policy iteration. 3. Well-structured presentation with illustrative figures (e.g., task decomposition in Fig.

Weaknesses

1. Risk of Overfitting in Task Decomposition: Decomposing hard tasks into varying difficulty levels may cause repeated exposure to similar knowledge patterns across subtasks, increasing overfitting risks, especially if subtasks overlap significantly without explicit regularization. 2. Lack of Implementation Details for Reproducibility: Key details are missing, such as prompts used for automatic difficulty estimation (e.g., in AQuA/GSM8K) or exact hyperparameters for task grouping, raising conc

Reviewer 02Rating 6Confidence 4

Strengths

The paper proposes a simple method of using curriculum learning. The curriculum implicitly assumes some grouping of tasks, but they also show that the grouping is not necessary because tasks can be clustered just using pass rates of the initial model. They also compare with different baselines and the empirical results seem sound.

Weaknesses

The only weakness that comes to mind is not comparing with DAPO [1] which also has an implicit curriculum because the model keeps filtering prompts that are either too easy or too hard. Could the authors compare with DAPO as well and show results on the benchmarks? Also the paper doesn't cite Paprika [2] which also proposes a curriculum when tasks can be grouped. [1] DAPO: An Open-Source LLM Reinforcement Learning System at Scale (https://arxiv.org/abs/2503.14476) [2] Training a Generally

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper provides theoretical justification for why CRL can achieve sample efficiency, requiring fewer total samples than direct learning on the final task. 2. The experimental results are sound and well-presented.

Weaknesses

1. The idea of using curriculum learning to improve RL efficiency is not novel. The paper acknowledged prior work—e.g., Chen et al., Foster et al., Bae et al., Zeng et al. which used curriculum learning ideas. The paper should also cite Yu et al. (DAPO: An Open-Source LLM Reinforcement Learning System at Scale). 2. In the experimental results, E2H does not consistently outperform baselines such as GRPO or Self-Evolve. 3. The paper does not clearly articulate the advantages of E2H over adaptive

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques