On student-teacher deviations in distillation: does it pay to disobey?

Vaishnavh Nagarajan; Aditya Krishna Menon; Srinadh Bhojanapalli,; Hossein Mobahi; Sanjiv Kumar

arXiv:2301.12923·cs.LG·March 20, 2024·6 cites

On student-teacher deviations in distillation: does it pay to disobey?

Vaishnavh Nagarajan, Aditya Krishna Menon, Srinadh Bhojanapalli,, Hossein Mobahi, Sanjiv Kumar

PDF

Open Access 1 Video

TL;DR

This paper investigates why student networks in knowledge distillation sometimes outperform teachers despite deviations from teacher probabilities, revealing that exaggeration of confidence and bias can enhance generalization.

Contribution

It characterizes student-teacher deviations in KD, linking confidence exaggeration and implicit bias to improved student performance and generalization.

Findings

01

Student exaggerates teacher's confidence levels.

02

KD accelerates convergence along top data eigendirections.

03

Exaggerated bias and confidence contribute to better generalization.

Abstract

Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network. Yet, it has been shown in recent work that, despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo than the teacher in performance. Our work aims to reconcile this seemingly paradoxical observation. Specifically, we characterize the precise nature of the student-teacher deviations, and argue how they can co-occur with better generalization. First, through experiments on image and language data, we identify that these probability deviations correspond to the student systematically exaggerating the confidence levels of the teacher. Next, we theoretically and empirically establish another form of exaggeration in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On student-teacher deviations in distillation: does it pay to disobey?· slideslive

Taxonomy

TopicsMiddle East and Rwanda Conflicts · Socioeconomic Development in MENA