Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Xinghao Chen; Zhijing Sun; Wenjin Guo; Miaoran Zhang; Yanjun Chen; Yirong Sun; Hui Su; Yijie Pan; Dietrich Klakow; Wenjie Li; Xiaoyu Shen

arXiv:2502.18001·cs.CL·May 28, 2025

Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Xinghao Chen, Zhijing Sun, Wenjin Guo, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, Xiaoyu Shen

PDF

Open Access 1 Repo

TL;DR

This paper investigates how different factors like granularity, format, and teacher model choice affect the effectiveness of distilling Chain-of-Thought reasoning from large to small language models, providing insights for optimizing this process.

Contribution

It systematically analyzes key factors influencing CoT distillation, revealing how model strength and supervision strategies impact performance in small language models.

Findings

01

SLMs show non-monotonic performance with granularity, benefiting from simpler CoT supervision.

02

CoT format impacts LLMs significantly but has minimal effect on SLMs.

03

Stronger teacher models do not always produce better student models due to diversity and complexity in CoT supervision.

Abstract

Large Language Models (LLMs) excel in reasoning tasks through Chain-of-Thought (CoT) prompting. However, CoT prompting greatly increases computational demands, which has prompted growing interest in distilling CoT capabilities into Small Language Models (SLMs). This study systematically examines the factors influencing CoT distillation, including the choice of granularity, format and teacher model. Through experiments involving four teacher models and seven student models across seven mathematical and commonsense reasoning datasets, we uncover three key findings: (1) Unlike LLMs, SLMs exhibit a non-monotonic relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision; (2) CoT format significantly impacts LLMs but has minimal effect on SLMs, likely due to their reliance on supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eit-nlp/distilling-cot-reasoning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques

MethodsChain-of-thought prompting