Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning
Hien Dang, Pratik Patil, Alessandro Rinaldo

TL;DR
This paper provides a comprehensive theoretical analysis of self-distillation in ridge regression, deriving optimal mixing weights, risk improvements, and a practical one-shot tuning method, supported by experiments.
Contribution
It introduces the first formal guarantees for unconstrained self-distillation in ridge regression, including explicit formulas for optimal mixing and risk improvement, along with a practical tuning approach.
Findings
Optimal mixing weight can be negative, especially in over-regularized regimes.
Self-distillation strictly improves the ridge teacher's risk under certain conditions.
The proposed one-shot tuning method accurately estimates the optimal mixing weight.
Abstract
Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher's own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level at which the teacher ridge risk is nonstationary (i.e., ). We obtain a closed-form expression for the optimal mixing weight for any value of and show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Stochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning
