How to Train the Teacher Model for Effective Knowledge Distillation
Shayan Mohajer Hamidi, Xizhen Deng, Renhao Tan, Linfeng Ye, Ahmed, Hussein Salamah

TL;DR
This paper shows that training the teacher model with MSE loss instead of cross-entropy improves knowledge distillation by making the teacher's output closer to the true Bayes conditional probability density, thereby enhancing student accuracy.
Contribution
It introduces a novel approach of training teachers with MSE loss for better knowledge distillation, supported by extensive experiments showing accuracy improvements.
Findings
Training with MSE loss improves student accuracy by up to 2.6%.
Replacing the conventional teacher with an MSE-trained teacher enhances KD effectiveness.
MSE training aligns the teacher's output more closely with the true BCPD.
Abstract
Recently, it was shown that the role of the teacher in knowledge distillation (KD) is to provide the student with an estimate of the true Bayes conditional probability density (BCPD). Notably, the new findings propose that the student's error rate can be upper-bounded by the mean squared error (MSE) between the teacher's output and BCPD. Consequently, to enhance KD efficacy, the teacher should be trained such that its output is close to BCPD in MSE sense. This paper elucidates that training the teacher model with MSE loss equates to minimizing the MSE between its output and BCPD, aligning with its core responsibility of providing the student with a BCPD estimate closely resembling it in MSE terms. In this respect, through a comprehensive set of experiments, we demonstrate that substituting the conventional teacher trained with cross-entropy loss with one trained using MSE loss in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTechnology-Enhanced Education Studies · Education and Critical Thinking Development · Innovative Teaching and Learning Methods
MethodsSparse Evolutionary Training · Knowledge Distillation
