The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model
Kaito Takanami, Takashi Takahashi, Ayaka Sakata

TL;DR
This paper investigates how self-distillation improves binary classification on noisy data, revealing denoising as the key factor and proposing heuristics like early stopping and bias fixing, validated through theoretical analysis and experiments.
Contribution
It provides a theoretical analysis of self-distillation's effectiveness in noisy Gaussian mixture models and introduces practical heuristics to enhance its performance.
Findings
Denoising via hard pseudo-labels is the main driver of SD's effectiveness.
Early stopping of stages broadly improves SD performance.
Bias parameter fixing aids in label imbalance scenarios.
Abstract
Self-distillation (SD), a technique where a model improves itself using its own predictions, has attracted attention as a simple yet powerful approach in machine learning. Despite its widespread use, the mechanisms underlying its effectiveness remain unclear. In this study, we investigate the efficacy of hyperparameter-tuned multi-stage SD with a linear classifier for binary classification on noisy Gaussian mixture data. For the analysis, we employ the replica method from statistical physics. Our findings reveal that the primary driver of SD's performance improvement is denoising through hard pseudo-labels, with the most notable gains observed in moderately sized datasets. We also identify two practical heuristics to enhance SD: early stopping that limits the number of stages, which is broadly effective, and bias parameter fixing, which helps under label imbalance. To empirically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpectroscopy and Chemometric Analyses · Bayesian Methods and Mixture Models · Religion and Sociopolitical Dynamics in Nigeria
MethodsSoftmax · Attention Is All You Need · Average Pooling · Convolution · Global Average Pooling · Kaiming Initialization · Max Pooling · Early Stopping
