Why does Knowledge Distillation Work? Rethink its Attention and Fidelity   Mechanism

Chenqi Guo; Shiwei Zhong; Xiaofeng Liu; Qianli Feng; Yinglong Ma

arXiv:2405.00739·cs.LG·May 3, 2024·1 cites

Why does Knowledge Distillation Work? Rethink its Attention and Fidelity Mechanism

Chenqi Guo, Shiwei Zhong, Xiaofeng Liu, Qianli Feng, Yinglong Ma

PDF

Open Access 1 Repo

TL;DR

This paper challenges traditional views on Knowledge Distillation by showing that lower fidelity and diverse teacher attentions, promoted through data augmentation, enhance student generalization rather than strict mimicry.

Contribution

It reveals that reduced attention similarity and fidelity in ensemble KD improve generalization, offering a new perspective on optimizing knowledge transfer.

Findings

01

Decreased attention IoU correlates with reduced student overfitting.

02

Stronger data augmentation increases attention diversity among teachers.

03

Lower mutual information between teacher and student benefits generalization.

Abstract

Does Knowledge Distillation (KD) really work? Conventional wisdom viewed it as a knowledge transfer procedure where a perfect mimicry of the student to its teacher is desired. However, paradoxical studies indicate that closely replicating the teacher's behavior does not consistently improve student generalization, posing questions on its possible causes. Confronted with this gap, we hypothesize that diverse attentions in teachers contribute to better student generalization at the expense of reduced fidelity in ensemble KD setups. By increasing data augmentation strengths, our key findings reveal a decrease in the Intersection over Union (IoU) of attentions between teacher models, leading to reduced student overfitting and decreased fidelity. We propose this low-fidelity phenomenon as an underlying characteristic rather than a pathology when training KD. This suggests that stronger data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zisci2/RethinkKD
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCognitive Science and Education Research · Intelligent Tutoring Systems and Adaptive Learning

MethodsKnowledge Distillation