TL;DR
This paper introduces a simple two-stage reinforcement learning curriculum that enhances large language models' reasoning abilities across multiple domains by first focusing on math and then transferring skills to other areas.
Contribution
It presents a minimal, backbone-agnostic curriculum that effectively improves reasoning in LLMs without requiring specialized reward models.
Findings
Consistent reasoning improvements across multiple models and domains
Both curriculum stages are necessary for optimal performance
Math-first approach enhances complex problem-solving skills
Abstract
Reinforcement learning (RL) can elicit strong reasoning in large language models (LLMs), yet most open efforts focus on math and code. We propose Reasoning Curriculum, a simple two-stage curriculum that first elicits reasoning skills in pretraining-aligned domains such as math, then adapts and refines these skills across other domains via joint RL. Stage 1 performs a brief cold start and then math-only RL with verifiable rewards to develop reasoning skills. Stage 2 runs joint RL on mixed-domain data to transfer and consolidate these skills. The curriculum is minimal and backbone-agnostic, requiring no specialized reward models beyond standard verifiability checks. Evaluated on Qwen3-4B and Llama-3.1-8B over a multi-domain suite, reasoning curriculum yields consistent gains. Ablations and a cognitive-skill analysis indicate that both stages are necessary and that math-first elicitation…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The paper tries to extend the capability learned from math tasks to other domains, which is a worth studying topics to enhance the reasoning in LLMs. The paper uses four kinds of skills to track the cognitive capability that assist reasoning, and they also provide quantitive results on this. * Experiments show that some of the coginitive patterns can be transferred from math tasks to other domain tasks like STEM and coding, using the proposed reasoning curriculum method. Benchmarking on differ
* The conclusion from this paper is not new. The method lacks significant contribution as a ICLR submission. The so called "curriculum" seems to be simply math (SFT + RL) then all-domain data. There are already a few discussion regarding math reasoning capability can be transferred to other or more general domains, e.g., [1,2]. * Unfair comparison. Guru uses Qwen2.5 series models (other baselines in Table 1 also use inferior base models compared with authors') as the backbone and this paper use
The authors study a timely and interesting topic. The prose is well-written, though vague and doesn't seem to contextualize its results well against similar work (see Weaknesses below).
The authors are vague on the details of their evaluation and comparison with similar projects (GURU, SimpleRL, etc.). * It's unclear how they ran their evaluations. What inference settings did they use specifically? * Did they re-run baselines from other projects using their own setup, or copied numbers from the other papers? * If they re-ran baselines, how did they do so exactly? Did they use their own grading/inference rig or standardize evaluations in some way across all baseline models? Th
1. The recipe is simple and practical; 2. Strong and Broad Empirical Results across domains; 3. Works on both Qwen and Llama;
1. Limited Comparison with Strong Baselines; 2. Missing head-to-head evaluations against recent strong open-source RL pipelines. 3. Benchmarks are mostly verifiable tasks; free-form reasoning is not evaluated.
- The proposed method is intuitive, and the motivation is well-introduced. As math is an amenable domain for RL, it makes sense that priming the model with these skills facilitates future broader reasoning. - Overall, the paper is well-written and free of typos. The methodology and the past techniques used are clearly presented. - The domain of RL reasoning is a relevant topic for language model research and the ICLR community.
**Main** 1. The choice of baselines seems very questionable to me, and the statements like the method "sometimes exceeds, 32B systems" are never properly contextualized: the model is only compared with baselines from the older Qwen2.5 model family, which is significantly weaker than the Qwen3 model family. Once again, the authors use Qwen3-4B as their Qwen model choice, not Qwen2. Yet, results for even baselines comparing the authors' model to just the performance of plain Qwen 3 4B, Qwen3 7B, a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
