Train Long, Think Short: Curriculum Learning for Efficient Reasoning

Hasan Abed Al Kader Hammoud; Kumail Alhamoud; Abed Hammoud; Elie Bou-Zeid; Marzyeh Ghassemi; Bernard Ghanem

arXiv:2508.08940·cs.CL·August 13, 2025

Train Long, Think Short: Curriculum Learning for Efficient Reasoning

Hasan Abed Al Kader Hammoud, Kumail Alhamoud, Abed Hammoud, Elie Bou-Zeid, Marzyeh Ghassemi, Bernard Ghanem

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a curriculum learning approach for length-controlled reasoning in large language models, gradually tightening token budgets during training to improve accuracy and efficiency.

Contribution

It proposes a novel curriculum learning strategy using GRPO that dynamically adjusts reasoning length constraints, enhancing model performance over fixed-budget methods.

Findings

01

Outperforms fixed-budget baselines in accuracy and token efficiency.

02

Gradual tightening of reasoning length acts as an effective inductive bias.

03

Reward balancing improves training effectiveness.

Abstract

Recent work on enhancing the reasoning abilities of large language models (LLMs) has introduced explicit length control as a means of constraining computational cost while preserving accuracy. However, existing approaches rely on fixed-length training budgets, which do not take advantage of the natural progression from exploration to compression during learning. In this work, we propose a curriculum learning strategy for length-controlled reasoning using Group Relative Policy Optimization (GRPO). Our method starts with generous token budgets and gradually tightens them over training, encouraging models to first discover effective solution strategies and then distill them into more concise reasoning traces. We augment GRPO with a reward function that balances three signals: task correctness (via verifier feedback), length efficiency, and formatting adherence (via structural tags).…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The paper presents a novel perspective on efficient reasoning training by framing it as a curriculum learning problem, which aligns intuitively with how models should progress from exploration to compression. The experimental evaluation is thorough and systematic, with consistent improvements across multiple benchmarks and comprehensive ablation studies examining reward weighting, decay schedules, reward shapes, and decay intervals. The clear visualization of the approach and detailed methodolog

Weaknesses

The study is significantly limited by its scope. All experiments use only QWEN-2.5-7B, leaving open questions about scaling behavior across model sizes, and the token budget is capped at 256, which may not reflect realistic reasoning scenarios requiring longer chains. The training datasets are relatively modest, and while improvements are consistent, they are sometimes modest, raising questions about practical significance. Additionally, the method introduces multiple hyperparameters requiring t

Reviewer 02Rating 4Confidence 3

Strengths

1. The experiments address six key questions, which are helpful in understanding the improvement. 2. The paper focuses on an important question. 3. The paper proposes a simple but useful method.

Weaknesses

1. The experiments are only conducted with QWEN-2.5-7B on math reasoning tasks. I think it will be helpful to show results on different model sizes and model families. 2. The user still needs to set a token budget, but it's hard for the users to know what the ideal budget is for each dataset. For example, hard questions need more tokens while easy questions need fewer. 3. I think some important baselines are missing. [1] Also target getting the best trade-off with different lengths. [1,2] trai

Reviewer 03Rating 2Confidence 3

Strengths

- This paper proposes a simple curriculum method for enabling efficient reasoning. - The paper is easy to read and follow. - Extensive results show that at least it is better than using a fixed budget with a shorter budget.

Weaknesses

- The training loss is just the same as previous papers, and the only difference is the curriculum. But is it really a more effective strategy compared to having a fixed-budget optimization like LCPO? It is highly skeptical. Novelty is also an issue here. - Another following problem is the baselines. The baselines are very weak. It does not compare with performance fine-tuning like pure GRPO, so we don’t know how it trades off performance and efficiency. Also, a lot of efficient fine-tuning meth

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques