Self-Distillation as Instance-Specific Label Smoothing
Zhilu Zhang, Mert R. Sabuncu

TL;DR
This paper explores how multi-generational self-distillation improves model generalization through increased prediction diversity, offering a new interpretation as instance-specific regularization and proposing a novel label smoothing method that enhances performance.
Contribution
It introduces a theoretical framework linking self-distillation to label smoothing via predictive diversity and presents a new instance-specific label smoothing technique that outperforms traditional methods.
Findings
Self-distillation increases prediction diversity, improving generalization.
Theoretical link between self-distillation and label smoothing.
Proposed method often outperforms classical label smoothing.
Abstract
It has been recently demonstrated that multi-generational self-distillation can improve generalization. Despite this intriguing observation, reasons for the enhancement remain poorly understood. In this paper, we first demonstrate experimentally that the improved performance of multi-generational self-distillation is in part associated with the increasing diversity in teacher predictions. With this in mind, we offer a new interpretation for teacher-student training as amortized MAP estimation, such that teacher predictions enable instance-specific regularization. Our framework allows us to theoretically relate self-distillation to label smoothing, a commonly used technique that regularizes predictive uncertainty, and suggests the importance of predictive diversity in addition to predictive uncertainty. We present experimental results using multiple datasets and neural network…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Robotic Path Planning Algorithms
MethodsLabel Smoothing
