On the Generalization of Knowledge Distillation: An Information-Theoretic View
Bingying Li, Haiyun He

TL;DR
This paper provides an information-theoretic framework for understanding the generalization capabilities of knowledge distillation, deriving bounds and insights into the roles of divergence, stability, and teacher flatness.
Contribution
It introduces a novel stochastic process model and divergence measure for distillation, deriving new generalization bounds and practical guidance for distillation design.
Findings
Derived upper and lower generalization bounds based on divergence and stability.
Showed teacher's local flatness can tighten generalization bounds.
Decomposed distillation divergence into bias, variance, and rank costs in a Gaussian case.
Abstract
Knowledge distillation is widely used to improve generalization in practice, yet its theoretical understanding remains elusive. In the standard distillation setting, a teacher model provides soft predictions to guide the training of a student model. We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback-Leibler divergence between these two stochastic kernels. Within this framework, we derive two generalization bounds for the student model relative to the teacher's generalization gap: an upper bound under a sub-Gaussian assumption via algorithmic stability, and a lower bound under a central condition with sharper dependence on the distillation divergence. We further develop a loss-sharpness-aware bound with an explicit tightness regime, showing that the teacher's local flatness can strictly tighten the bound.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
