Can an Easy-to-Hard Curriculum Make Reasoning Emerge in Small Language Models? Evidence from a Four-Stage Curriculum on GPT-2
Xiang Fu

TL;DR
This paper shows that a carefully designed curriculum can significantly improve reasoning abilities and training efficiency in small language models like GPT-2, by progressing through increasingly complex reasoning tasks.
Contribution
It introduces a four-stage curriculum for training small language models that enhances reasoning, efficiency, and interpretability without task-specific fine-tuning.
Findings
Faster target accuracy with fewer training steps
More gradient-salient reasoning heads activated
Heads shift to deeper layers with higher entropy
Abstract
We demonstrate that a developmentally ordered curriculum markedly improves reasoning transparency and sample-efficiency in small language models (SLMs). Concretely, we train Cognivolve, a 124 M-parameter GPT-2 model, on a four-stage syllabus that ascends from lexical matching to multi-step symbolic inference and then evaluate it without any task-specific fine-tuning. Cognivolve reaches target accuracy in half the optimization steps of a single-phase baseline, activates an order-of-magnitude more gradient-salient reasoning heads, and shifts those heads toward deeper layers, yielding higher-entropy attention that balances local and long-range context. The same curriculum applied out of order or with optimizer resets fails to reproduce these gains, confirming that progression--not extra compute--drives the effect. We also identify open challenges: final-answer success still lags a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Language Development and Disorders
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Layer Normalization · Softmax · Attention Dropout · Residual Connection · Linear Layer · Byte Pair Encoding
