On student-teacher deviations in distillation: does it pay to disobey?
Vaishnavh Nagarajan, Aditya Krishna Menon, Srinadh Bhojanapalli,, Hossein Mobahi, Sanjiv Kumar

TL;DR
This paper investigates why student networks in knowledge distillation sometimes outperform teachers despite deviations from teacher probabilities, revealing that exaggeration of confidence and bias can enhance generalization.
Contribution
It characterizes student-teacher deviations in KD, linking confidence exaggeration and implicit bias to improved student performance and generalization.
Findings
Student exaggerates teacher's confidence levels.
KD accelerates convergence along top data eigendirections.
Exaggerated bias and confidence contribute to better generalization.
Abstract
Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network. Yet, it has been shown in recent work that, despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo than the teacher in performance. Our work aims to reconcile this seemingly paradoxical observation. Specifically, we characterize the precise nature of the student-teacher deviations, and argue how they can co-occur with better generalization. First, through experiments on image and language data, we identify that these probability deviations correspond to the student systematically exaggerating the confidence levels of the teacher. Next, we theoretically and empirically establish another form of exaggeration in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMiddle East and Rwanda Conflicts · Socioeconomic Development in MENA
