Distillation $\approx$ Early Stopping? Harvesting Dark Knowledge   Utilizing Anisotropic Information Retrieval For Overparameterized Neural   Network

Bin Dong; Jikai Hou; Yiping Lu; Zhihua Zhang

arXiv:1910.01255·stat.ML·October 4, 2019·27 cites

Distillation $\approx$ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network

Bin Dong, Jikai Hou, Yiping Lu, Zhihua Zhang

PDF

Open Access 1 Repo

TL;DR

This paper provides a theoretical framework explaining how early stopping in overparameterized neural networks helps distillation by harvesting dark knowledge, and introduces a self-distillation method that refines noisy labels and improves generalization.

Contribution

It introduces the concept of Anisotropic Information Retrieval (AIR) to explain distillation, and proposes a self-distillation algorithm with theoretical convergence guarantees and practical benefits.

Findings

01

Self-distillation improves accuracy without early stopping.

02

Theoretical convergence to ground truth labels in overparameterized networks.

03

Empirical results show better test accuracy and robustness to noisy labels.

Abstract

Distillation is a method to transfer knowledge from one model to another and often achieves higher accuracy with the same capacity. In this paper, we aim to provide a theoretical understanding on what mainly helps with the distillation. Our answer is "early stopping". Assuming that the teacher network is overparameterized, we argue that the teacher network is essentially harvesting dark knowledge from the data via early stopping. This can be justified by a new concept, {Anisotropic Information Retrieval (AIR)}, which means that the neural network tends to fit the informative information first and the non-informative information (including noise) later. Motivated by the recent development on theoretically analyzing overparameterized neural networks, we can characterize AIR by the eigenspace of the Neural Tangent Kernel(NTK). AIR facilities a new understanding of distillation. With that,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lizhemin15/self-distillation
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and Data Classification

MethodsEarly Stopping