Can an Easy-to-Hard Curriculum Make Reasoning Emerge in Small Language Models? Evidence from a Four-Stage Curriculum on GPT-2

Xiang Fu

arXiv:2505.11643·cs.CL·May 20, 2025

Can an Easy-to-Hard Curriculum Make Reasoning Emerge in Small Language Models? Evidence from a Four-Stage Curriculum on GPT-2

Xiang Fu

PDF

Open Access

TL;DR

This paper shows that a carefully designed curriculum can significantly improve reasoning abilities and training efficiency in small language models like GPT-2, by progressing through increasingly complex reasoning tasks.

Contribution

It introduces a four-stage curriculum for training small language models that enhances reasoning, efficiency, and interpretability without task-specific fine-tuning.

Findings

01

Faster target accuracy with fewer training steps

02

More gradient-salient reasoning heads activated

03

Heads shift to deeper layers with higher entropy

Abstract

We demonstrate that a developmentally ordered curriculum markedly improves reasoning transparency and sample-efficiency in small language models (SLMs). Concretely, we train Cognivolve, a 124 M-parameter GPT-2 model, on a four-stage syllabus that ascends from lexical matching to multi-step symbolic inference and then evaluate it without any task-specific fine-tuning. Cognivolve reaches target accuracy in half the optimization steps of a single-phase baseline, activates an order-of-magnitude more gradient-salient reasoning heads, and shifts those heads toward deeper layers, yielding higher-entropy attention that balances local and long-range context. The same curriculum applied out of order or with optimizer resets fails to reproduce these gains, confirming that progression--not extra compute--drives the effect. We also identify open challenges: final-answer success still lags a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Language Development and Disorders

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Layer Normalization · Softmax · Attention Dropout · Residual Connection · Linear Layer · Byte Pair Encoding