Knowledge Distillation Beyond Model Compression
Fahad Sarfraz, Elahe Arani, Bahram Zonooz

TL;DR
This paper provides an extensive analysis of nine knowledge distillation methods, demonstrating their versatility and robustness across datasets, architectures, and challenges like label noise and class imbalance, highlighting KD's potential beyond model compression.
Contribution
The study offers a comprehensive comparison of diverse KD approaches, insights into their effectiveness, and advocates for KD as a general training paradigm rather than just a compression technique.
Findings
KD methods improve generalization over standard training.
KD is effective under label noise and class imbalance.
Insights guide the design of more effective KD techniques.
Abstract
Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various techniques have been proposed since the original formulation, which mimic different aspects of the teacher such as the representation space, decision boundary, or intra-data relationship. Some methods replace the one-way knowledge distillation from a static teacher with collaborative learning between a cohort of students. Despite the recent advances, a clear understanding of where knowledge resides in a deep neural network and an optimal method for capturing knowledge from teacher and transferring it to student remains an open question. In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation
