Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation
Kenneth Borup, Lars N. Andersen

TL;DR
This paper provides a theoretical analysis of self-distillation in kernel regression, demonstrating how incorporating ground-truth targets influences regularization and offering practical methods to optimize distillation parameters.
Contribution
It introduces the first theoretical framework for weighted ground-truth targets in self-distillation and derives closed-form solutions for optimal weighting, reducing computational costs.
Findings
Infinite distillation amplifies regularization effects.
Optimal weighting parameters can be efficiently estimated.
Ground-truth targets significantly impact regularization in self-distillation.
Abstract
Knowledge distillation is classically a procedure where a neural network is trained on the output of another network along with the original targets in order to transfer knowledge between the architectures. The special case of self-distillation, where the network architectures are identical, has been observed to improve generalization accuracy. In this paper, we consider an iterative variant of self-distillation in a kernel regression setting, in which successive steps incorporate both model outputs and the ground-truth targets. This allows us to provide the first theoretical results on the importance of using the weighted ground-truth targets in self-distillation. Our focus is on fitting nonlinear functions to training data with a weighted mean square error objective function suitable for distillation, subject to regularization of the model parameters. We show that any such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Fire Detection and Safety Systems · Neural Networks and Applications
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · 1x1 Convolution · Residual Connection · Residual Block · Bottleneck Residual Block · Average Pooling · Max Pooling · Convolution · Global Average Pooling
