On the Generalization vs Fidelity Paradox in Knowledge Distillation
Suhas Kamasetty Ramesh, Ayan Sengupta, Tanmoy Chakraborty

TL;DR
This paper provides a large-scale empirical analysis of knowledge distillation in language models, revealing its benefits, limitations, and the complex relationship between performance and reasoning fidelity across different model sizes.
Contribution
It is the first comprehensive study examining how knowledge distillation affects various language models and reasoning tasks, highlighting key factors influencing its effectiveness.
Findings
KD improves smaller models' performance by up to 10%
Teacher performance has minimal impact on student outcomes
KD benefits diminish as model size increases
Abstract
Knowledge distillation (KD) is a key technique for compressing large language models into smaller ones while preserving performance. Despite the recent traction of KD research, its effectiveness for smaller language models (LMs) and the mechanisms driving knowledge transfer remain underexplored. In this work, we present the first large-scale empirical and statistical analysis of KD across models ranging from 0.5B to 7B parameters on 14 complex reasoning tasks in a zero-shot setting. Our findings reveal that KD can improve the average performance of smaller models by up to , with a peak task specific gain of , while providing only marginal benefits () for larger models. Surprisingly, teacher performance has a minimal impact on student outcomes, while teacher task expertise impacts KD effectiveness. A correlation study indicates that smaller LMs benefit more from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
