The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model

Kaito Takanami; Takashi Takahashi; Ayaka Sakata

arXiv:2501.16226·stat.ML·November 20, 2025

The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model

Kaito Takanami, Takashi Takahashi, Ayaka Sakata

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates how self-distillation improves binary classification on noisy data, revealing denoising as the key factor and proposing heuristics like early stopping and bias fixing, validated through theoretical analysis and experiments.

Contribution

It provides a theoretical analysis of self-distillation's effectiveness in noisy Gaussian mixture models and introduces practical heuristics to enhance its performance.

Findings

01

Denoising via hard pseudo-labels is the main driver of SD's effectiveness.

02

Early stopping of stages broadly improves SD performance.

03

Bias parameter fixing aids in label imbalance scenarios.

Abstract

Self-distillation (SD), a technique where a model improves itself using its own predictions, has attracted attention as a simple yet powerful approach in machine learning. Despite its widespread use, the mechanisms underlying its effectiveness remain unclear. In this study, we investigate the efficacy of hyperparameter-tuned multi-stage SD with a linear classifier for binary classification on noisy Gaussian mixture data. For the analysis, we employ the replica method from statistical physics. Our findings reveal that the primary driver of SD's performance improvement is denoising through hard pseudo-labels, with the most notable gains observed in moderately sized datasets. We also identify two practical heuristics to enhance SD: early stopping that limits the number of stages, which is broadly effective, and bias parameter fixing, which helps under label imbalance. To empirically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

taka255/self-distillation-analysis
pytorchOfficial

Videos

The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model· slideslive

Taxonomy

TopicsSpectroscopy and Chemometric Analyses · Bayesian Methods and Mixture Models · Religion and Sociopolitical Dynamics in Nigeria

MethodsSoftmax · Attention Is All You Need · Average Pooling · Convolution · Global Average Pooling · Kaiming Initialization · Max Pooling · Early Stopping