On the Generalization vs Fidelity Paradox in Knowledge Distillation

Suhas Kamasetty Ramesh; Ayan Sengupta; Tanmoy Chakraborty

arXiv:2505.15442·cs.CL·August 5, 2025

On the Generalization vs Fidelity Paradox in Knowledge Distillation

Suhas Kamasetty Ramesh, Ayan Sengupta, Tanmoy Chakraborty

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper provides a large-scale empirical analysis of knowledge distillation in language models, revealing its benefits, limitations, and the complex relationship between performance and reasoning fidelity across different model sizes.

Contribution

It is the first comprehensive study examining how knowledge distillation affects various language models and reasoning tasks, highlighting key factors influencing its effectiveness.

Findings

01

KD improves smaller models' performance by up to 10%

02

Teacher performance has minimal impact on student outcomes

03

KD benefits diminish as model size increases

Abstract

Knowledge distillation (KD) is a key technique for compressing large language models into smaller ones while preserving performance. Despite the recent traction of KD research, its effectiveness for smaller language models (LMs) and the mechanisms driving knowledge transfer remain underexplored. In this work, we present the first large-scale empirical and statistical analysis of KD across models ranging from 0.5B to 7B parameters on 14 complex reasoning tasks in a zero-shot setting. Our findings reveal that KD can improve the average performance of smaller models by up to $10%$ , with a peak task specific gain of $22%$ , while providing only marginal benefits ( $\sim 1.3%$ ) for larger models. Surprisingly, teacher performance has a minimal impact on student outcomes, while teacher task expertise impacts KD effectiveness. A correlation study indicates that smaller LMs benefit more from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LCS2-IIITD/KD_generalization
pytorchOfficial

Videos

On the Generalization vs Fidelity Paradox in Knowledge Distillation· underline

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)