Distillation $\approx$ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network
Bin Dong, Jikai Hou, Yiping Lu, Zhihua Zhang

TL;DR
This paper provides a theoretical framework explaining how early stopping in overparameterized neural networks helps distillation by harvesting dark knowledge, and introduces a self-distillation method that refines noisy labels and improves generalization.
Contribution
It introduces the concept of Anisotropic Information Retrieval (AIR) to explain distillation, and proposes a self-distillation algorithm with theoretical convergence guarantees and practical benefits.
Findings
Self-distillation improves accuracy without early stopping.
Theoretical convergence to ground truth labels in overparameterized networks.
Empirical results show better test accuracy and robustness to noisy labels.
Abstract
Distillation is a method to transfer knowledge from one model to another and often achieves higher accuracy with the same capacity. In this paper, we aim to provide a theoretical understanding on what mainly helps with the distillation. Our answer is "early stopping". Assuming that the teacher network is overparameterized, we argue that the teacher network is essentially harvesting dark knowledge from the data via early stopping. This can be justified by a new concept, {Anisotropic Information Retrieval (AIR)}, which means that the neural network tends to fit the informative information first and the non-informative information (including noise) later. Motivated by the recent development on theoretically analyzing overparameterized neural networks, we can characterize AIR by the eigenspace of the Neural Tangent Kernel(NTK). AIR facilities a new understanding of distillation. With that,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and Data Classification
MethodsEarly Stopping
