On the Efficacy of Knowledge Distillation
Jang Hyun Cho, Bharath Hariharan

TL;DR
This paper critically evaluates knowledge distillation, revealing that larger models do not always serve as better teachers due to capacity mismatches, and proposes early stopping as a mitigation strategy.
Contribution
It provides a comprehensive analysis of factors affecting knowledge distillation, highlighting the impact of teacher size and training strategies on effectiveness.
Findings
Larger models often do not outperform smaller teachers.
Sequential distillation steps are generally ineffective.
Early stopping of teacher training improves distillation outcomes.
Abstract
In this paper, we present a thorough evaluation of the efficacy of knowledge distillation and its dependence on student and teacher architectures. Starting with the observation that more accurate teachers often don't make good teachers, we attempt to tease apart the factors that affect knowledge distillation performance. We find crucially that larger models do not often make better teachers. We show that this is a consequence of mismatched capacity, and that small students are unable to mimic large teachers. We find typical ways of circumventing this (such as performing a sequence of knowledge distillation steps) to be ineffective. Finally, we show that this effect can be mitigated by stopping the teacher's training early. Our results generalize across datasets and models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation
