Self-Evolving Curriculum for LLM Reasoning
Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Pich\'e, Nicolas Gontier, Yoshua Bengio, Ehsan Kamalloo

TL;DR
This paper introduces Self-Evolving Curriculum (SEC), an automatic curriculum learning method for RL fine-tuning of LLMs that dynamically selects training problems to enhance reasoning skills and generalization.
Contribution
SEC formulates curriculum selection as a Multi-Armed Bandit problem and updates it with policy gradient methods, improving reasoning performance across multiple domains.
Findings
SEC significantly improves reasoning capabilities.
Models generalize better to out-of-distribution problems.
Achieves balanced skill development across domains.
Abstract
Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The studied problem is important. * Different tasks (Inductive reasoning, Planning, and Math) are involved in the experiments. * Both ID (in distribution) and OOD (out of distribution) settings are considered in the experiments. * This paper is well-written.
* This paper does not compare against many existing curriculum-learning methods for reinforcement learning, despite a growing body of existing works (see references below). The related work section also listed many existing RL curriculum learning methods, but they are not compared in the experiments. The baselines used in the experiments are RFT method without curriculum or that with naive curriculum, which makes it hard to compare the proposed method and other advanced curriculum learning metho
1. Novelty: The formulation of adaptive curriculum learning as a non-stationary MAB problem is novel in the context of LLM RL-finetuning. 2. Strong Empirical Results: The paper provides extensive experiments across three distinct reasoning domains (planning, inductive reasoning, mathematics) and two model scales (3B and 7B parameters). 3. Clarity and Reproducibility: The paper is well-written, and Algorithm 1 provides a clear outline of the method. The authors have also included details on mod
1. While the core method is well-evaluated, a more detailed ablation study would strengthen the paper. For instance, how crucial is the specific choice of the absolute advantage? How do the performance gains compare to the computational cost of maintaining and updating the MAB policy? Furthermore, the hyperparameters for the MAB (learning rate, temperature) are provided but not discussed in terms of their sensitivity or impact on final performance. A brief analysis would be valuable. 2. The con
- The paper is well-written and easy to follow. - It tackles an important problem in fine-tuning reasoning models. - The proposed solution is practical and has potential real-world impact.
The experimental comparison is limited to simple baselines (random and easy-to-hard curricula). However, several zone of proximal development (ZPD) and self-paced learning based curriculum RL methods (Florensa et al., 2018; Klink et al., 2020; Eimer et al., 2021; Tzannetos et al., 2023) exist with comparable computational cost to SEC. These approaches are also not discussed in the related work section. In particular, ProCuRL (Tzannetos et al., 2023) provides a simple baseline. It applies to bot
1. The motivation to frame the curriculum as a non-stationary MAB is intuitive. Using the absolute advantage as a reward signal is a natural choice. 2. The experimental setup is thorough. The authors evaluate SEC across three distinct reasoning domains, using two different model sizes. Further ablations showing the method's effectiveness with different RL algorithms, and with automatically inferred curriculum categories strengthen the paper's claims. 3. The results show the benefits of SEC, e
1. The paper notes that the performance gap between the SEC and a random curriculum narrows for the larger Qwen2.5-7B model on several tasks. This raises an important question about the scalability of the approach's benefits. As foundation models become increasingly capable, the inherent need for a carefully curated curriculum might decrease, potentially limiting the long-term impact of this method. 2. The paper employs a TD(0) update to handle the non-stationary nature of the MAB problem. Wh
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Open Education and E-Learning · Information Systems Education and Curriculum Development
