How to Train the Teacher Model for Effective Knowledge Distillation

Shayan Mohajer Hamidi; Xizhen Deng; Renhao Tan; Linfeng Ye; Ahmed; Hussein Salamah

arXiv:2407.18041·cs.LG·July 26, 2024

How to Train the Teacher Model for Effective Knowledge Distillation

Shayan Mohajer Hamidi, Xizhen Deng, Renhao Tan, Linfeng Ye, Ahmed, Hussein Salamah

PDF

Open Access 1 Repo

TL;DR

This paper shows that training the teacher model with MSE loss instead of cross-entropy improves knowledge distillation by making the teacher's output closer to the true Bayes conditional probability density, thereby enhancing student accuracy.

Contribution

It introduces a novel approach of training teachers with MSE loss for better knowledge distillation, supported by extensive experiments showing accuracy improvements.

Findings

01

Training with MSE loss improves student accuracy by up to 2.6%.

02

Replacing the conventional teacher with an MSE-trained teacher enhances KD effectiveness.

03

MSE training aligns the teacher's output more closely with the true BCPD.

Abstract

Recently, it was shown that the role of the teacher in knowledge distillation (KD) is to provide the student with an estimate of the true Bayes conditional probability density (BCPD). Notably, the new findings propose that the student's error rate can be upper-bounded by the mean squared error (MSE) between the teacher's output and BCPD. Consequently, to enhance KD efficacy, the teacher should be trained such that its output is close to BCPD in MSE sense. This paper elucidates that training the teacher model with MSE loss equates to minimizing the MSE between its output and BCPD, aligning with its core responsibility of providing the student with a BCPD estimate closely resembling it in MSE terms. In this respect, through a comprehensive set of experiments, we demonstrate that substituting the conventional teacher trained with cross-entropy loss with one trained using MSE loss in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eccv2024mse/eccv_mse_teacher
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTechnology-Enhanced Education Studies · Education and Critical Thinking Development · Innovative Teaching and Learning Methods

MethodsSparse Evolutionary Training · Knowledge Distillation