Towards Understanding Knowledge Distillation
Mary Phuong, Christoph H. Lampert

TL;DR
This paper provides the first theoretical insights into knowledge distillation, demonstrating how data geometry, optimization bias, and monotonicity influence the convergence and success of distillation in linear classifiers.
Contribution
It offers the first theoretical analysis of knowledge distillation, deriving a generalization bound and identifying key factors affecting its effectiveness.
Findings
Data geometry influences convergence speed.
Gradient descent finds favorable minima in distillation.
Expected risk decreases with larger training sets.
Abstract
Knowledge distillation, i.e., one classifier being trained on the outputs of another classifier, is an empirically very successful technique for knowledge transfer between classifiers. It has even been observed that classifiers learn much faster and more reliably if trained with the outputs of another classifier as soft labels, instead of from ground truth data. So far, however, there is no satisfactory theoretical explanation of this phenomenon. In this work, we provide the first insights into the working mechanisms of distillation by studying the special case of linear and deep linear classifiers. Specifically, we prove a generalization bound that establishes fast convergence of the expected risk of a distillation-trained linear classifier. From the bound and its proof we extract three key factors that determine the success of distillation: * data geometry -- geometric properties of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Machine Learning and ELM
