Self-Distillation Amplifies Regularization in Hilbert Space
Hossein Mobahi, Mehrdad Farajtabar, Peter L. Bartlett

TL;DR
This paper provides a theoretical analysis of self-distillation in deep learning, revealing how iterative self-distillation acts as a form of regularization in Hilbert space, affecting model complexity and performance.
Contribution
It introduces the first rigorous theoretical framework for understanding self-distillation, showing its role in modifying regularization and basis function selection.
Findings
Self-distillation limits the number of basis functions in the model.
Few rounds of self-distillation reduce overfitting.
Too many rounds can cause underfitting and degrade performance.
Abstract
Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. To the best of our knowledge, there is no rigorous understanding of this phenomenon. This work provides the first theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
