Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

Hien Dang; Pratik Patil; Alessandro Rinaldo

arXiv:2602.17565·math.ST·February 20, 2026

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

Hien Dang, Pratik Patil, Alessandro Rinaldo

PDF

Open Access

TL;DR

This paper provides a comprehensive theoretical analysis of self-distillation in ridge regression, deriving optimal mixing weights, risk improvements, and a practical one-shot tuning method, supported by experiments.

Contribution

It introduces the first formal guarantees for unconstrained self-distillation in ridge regression, including explicit formulas for optimal mixing and risk improvement, along with a practical tuning approach.

Findings

01

Optimal mixing weight can be negative, especially in over-regularized regimes.

02

Self-distillation strictly improves the ridge teacher's risk under certain conditions.

03

The proposed one-shot tuning method accurately estimates the optimal mixing weight.

Abstract

Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher's own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight $ξ$ may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level $λ > 0$ at which the teacher ridge risk $R (λ)$ is nonstationary (i.e., $R^{'} (λ) \neq = 0$ ). We obtain a closed-form expression for the optimal mixing weight $ξ^{⋆} (λ)$ for any value of $λ$ and show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Stochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning