TL;DR
This paper investigates the negative effects of long chain-of-thought training on small language models, revealing significant performance degradation due to error accumulation and offering insights for improving small-scale reasoning models.
Contribution
It identifies and analyzes Long CoT Degradation in small language models, providing empirical evidence and practical guidance to mitigate this issue.
Findings
Long CoT Degradation is widespread across SLMs.
Training on limited long CoT data can cause up to 75% performance loss.
Sufficiently scaled fine-tuning can alleviate degradation effects.
Abstract
Long chain-of-thought (CoT) supervision has become a common strategy to enhance reasoning in language models. While effective for large models, we identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs; <=3B parameters) trained on limited long CoT data experience significant performance deterioration. Through extensive experiments on the Qwen2.5, LLaMA3 and Gemma3 families, we demonstrate that this degradation is widespread across SLMs. In some settings, models trained on only 8k long CoT examples lose up to 75% of their original performance before fine-tuning. Strikingly, we further observe that for some particularly small models, even training on 220k long CoT examples fails to recover or surpass their original performance prior to fine-tuning. Our analysis attributes this effect to error accumulation: while longer responses increase the capacity for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
