Knowledge Distillation: Bad Models Can Be Good Role Models
Gal Kaplun, Eran Malach, Preetum Nakkiran, Shai Shalev-Shwartz

TL;DR
This paper explores how overparameterized neural networks, despite fitting noise and being poor classifiers, can serve as effective teachers in knowledge distillation, leading to optimal student classifiers.
Contribution
It provides a theoretical framework linking noisy samplers to knowledge distillation, showing bad models can produce good teachers for optimal learning.
Findings
Samplers can be good teachers despite poor classification performance.
Distillation from samplers guarantees approximation of the Bayes optimal classifier.
Overparameterized algorithms like Nearest-Neighbours can generate samplers.
Abstract
Large neural networks trained in the overparameterized regime are able to fit noise to zero train error. Recent work \citep{nakkiran2020distributional} has empirically observed that such networks behave as "conditional samplers" from the noisy distribution. That is, they replicate the noise in the train data to unseen examples. We give a theoretical framework for studying this conditional sampling behavior in the context of learning theory. We relate the notion of such samplers to knowledge distillation, where a student network imitates the outputs of a teacher on unlabeled data. We show that samplers, while being bad classifiers, can be good teachers. Concretely, we prove that distillation from samplers is guaranteed to produce a student which approximates the Bayes optimal classifier. Finally, we show that some common learning algorithms (e.g., Nearest-Neighbours and Kernel Machines)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Neural Networks and Applications
