SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines
Itai Morad, Nir Shlezinger, Yonina C. Eldar

TL;DR
This paper provides a Bayesian theoretical analysis of knowledge distillation with SGD, showing how Bayesian teachers improve student accuracy and stability, and offers practical guidelines for effective distillation.
Contribution
It introduces a Bayesian perspective to analyze KD convergence, demonstrating benefits of Bayesian teachers over deterministic ones and guiding improved distillation practices.
Findings
Bayesian teachers reduce variance and improve convergence.
Students from Bayesian teachers achieve up to +4.27% accuracy.
Distillation from Bayesian teachers results in more stable training with less noise.
Abstract
Knowledge Distillation (KD) is a central paradigm for transferring knowledge from a large teacher network to a typically smaller student model, often by leveraging soft probabilistic outputs. While KD has shown strong empirical success in numerous applications, its theoretical underpinnings remain only partially understood. In this work, we adopt a Bayesian perspective on KD to rigorously analyze the convergence behavior of students trained with Stochastic Gradient Descent (SGD). We study two regimes: when the teacher provides the exact Bayes Class Probabilities (BCPs); and supervision with noisy approximations of the BCPs. Our analysis shows that learning from BCPs yields variance reduction and removes neighborhood terms in the convergence bounds compared to one-hot supervision. We further characterize how the level of noise affects generalization and accuracy. Motivated…
Peer Reviews
Decision·ICLR 2026 Poster
1. Showing that the CE risk with BCP supervision shares the same minimizer as standard supervision (the Bayes posterior; the minimum equals (H(Y|X))), then establishing interpolation for the BCP-supervised objective (Props. 1–2), is crisp and well-grounded. 2. Thms. 1–2 remove the variance neighborhood term found in standard SGD and allow a wider stepsize range, formalizing a compelling optimization advantage of distillation from *accurate* probabilities. 3. The Dirichlet perturbation app
1. Prop. 3 weights Jacobian norms by (1/P(y_k|x)) (or (1/P(y_k|x)^2) with noisy BCPs). If any class probability can be arbitrarily small, the gradient-noise bounds can blow up. You should make explicit an assumption like (P(y_k|x)\ge \epsilon>0) (or work with smoothed targets) and reflect this in all statements depending on Eqs. (13)–(14). 2. Additive perturbations can leave the simplex. While Appendix D covers a Dirichlet alternative, the main text should either use the Dirichlet model (prefe
1. Originality: The paper provides a mathematical proof, from a Bayesian theoretical perspective, explaining why using soft probabilistic outputs in KD leads to better performance under the setting of an SGD optimizer. As mentioned in the Related Work section, the authors generalize this theoretical result beyond special cases such as self-distillation or model compression to more general classification settings, which represents a clear contribution compared to prior research. 2. Quality: The
1. In Figure 1, it would be beneficial to further quantify the amount of noise and present this quantitatively. Moreover, based on the plots of generalization error and test accuracy per epoch, it seems that the results were obtained using a single random seed. If the authors were to test with multiple seeds and compute the variance of generalization error and test accuracy per epoch for the four cases, it could more clearly demonstrate that the true Bayes probabilities exhibit significantly low
1. This paper is well organized, highly detailed, and balanced between theoretical depth and readability. 2. The theoretical analysis is supported by empirical evidence. 3. There is potential practicality, as the authors also show the benefit of converting pre-trained models into BNNs to improve the effectiveness of knowledge distillation.
1. (Minor) The analysis is based on SGD, but the experiments are conducted on Adam. Although this can show that the analysis also applies to other SGD-related optimizers, it would be better if there were some analyses or at least citations to show such generalizability from a theoretical perspective. 2. (Minor) The experiments are based on image classification solely. Could there be more complex tasks, such as semantic segmentation or object detection? 3. (Minor) Some related work is recommended
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Online Learning and Analytics · Advanced Graph Neural Networks
