Why does Knowledge Distillation Work? Rethink its Attention and Fidelity Mechanism
Chenqi Guo, Shiwei Zhong, Xiaofeng Liu, Qianli Feng, Yinglong Ma

TL;DR
This paper challenges traditional views on Knowledge Distillation by showing that lower fidelity and diverse teacher attentions, promoted through data augmentation, enhance student generalization rather than strict mimicry.
Contribution
It reveals that reduced attention similarity and fidelity in ensemble KD improve generalization, offering a new perspective on optimizing knowledge transfer.
Findings
Decreased attention IoU correlates with reduced student overfitting.
Stronger data augmentation increases attention diversity among teachers.
Lower mutual information between teacher and student benefits generalization.
Abstract
Does Knowledge Distillation (KD) really work? Conventional wisdom viewed it as a knowledge transfer procedure where a perfect mimicry of the student to its teacher is desired. However, paradoxical studies indicate that closely replicating the teacher's behavior does not consistently improve student generalization, posing questions on its possible causes. Confronted with this gap, we hypothesize that diverse attentions in teachers contribute to better student generalization at the expense of reduced fidelity in ensemble KD setups. By increasing data augmentation strengths, our key findings reveal a decrease in the Intersection over Union (IoU) of attentions between teacher models, leading to reduced student overfitting and decreased fidelity. We propose this low-fidelity phenomenon as an underlying characteristic rather than a pathology when training KD. This suggests that stronger data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Science and Education Research · Intelligent Tutoring Systems and Adaptive Learning
MethodsKnowledge Distillation
