Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective
Tokio Kajitsuka, Ukyo Honda, Sho Takase

TL;DR
This paper critically examines the capacity gap in Chain-of-Thought distillation, revealing that performance degradation is common and depends on various factors, offering practical guidance for better teacher-student pair selection.
Contribution
It introduces a realistic evaluation protocol and shows that capacity gap effects are inconsistent, challenging prior assumptions and providing new insights for CoT distillation.
Findings
Distillation often degrades performance compared to the student's baseline.
The impact of capacity gap varies across tasks and settings.
Teacher performance differences influence distillation outcomes.
Abstract
Chain-of-thought (CoT) distillation transfers reasoning behaviors from a strong teacher to a smaller student, but prior work reports a capacity gap: distillation may fail when the teacher-student capability mismatch is large. We revisit the capacity gap from a practical perspective by re-examining commonly used experimental settings. Notably, we find that CoT distillation often degrades performance compared to the student's pre-distillation baseline, an issue obscured when only post-distillation comparisons are reported. We therefore propose a more realistic evaluation protocol and find that the impact of capacity gap effects does not consistently dominate across tasks and settings, especially when candidate teachers differ substantially in performance. Our results offer practical guidance for selecting teacher-student pairs in CoT distillation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
