Understanding the Effects of Projectors in Knowledge Distillation
Yudong Chen, Sen Wang, Jiajun Liu, Xuwei Xu, Frank de Hoog, Brano, Kusy, Zi Huang

TL;DR
This paper investigates the often-overlooked role of projectors in knowledge distillation, revealing their benefits even when feature dimensions match and proposing an ensemble method to enhance distillation performance.
Contribution
It uncovers the positive effects of projectors in knowledge distillation and introduces a projector ensemble approach for improved student model performance.
Findings
Students with projectors achieve better accuracy trade-offs.
Projectors help preserve teacher-student similarity beyond numeric metrics.
The proposed ensemble method outperforms baseline distillation techniques.
Abstract
Conventionally, during the knowledge distillation process (e.g. feature distillation), an additional projector is often required to perform feature transformation due to the dimension mismatch between the teacher and the student networks. Interestingly, we discovered that even if the student and the teacher have the same feature dimensions, adding a projector still helps to improve the distillation performance. In addition, projectors even improve logit distillation if we add them to the architecture too. Inspired by these surprising findings and the general lack of understanding of the projectors in the knowledge distillation process from existing literature, this paper investigates the implicit role that projectors play but so far have been overlooked. Our empirical study shows that the student with a projector (1) obtains a better trade-off between the training accuracy and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Adversarial Robustness in Machine Learning · Neural Networks and Applications
MethodsKnowledge Distillation
