TL;DR
This paper proposes a novel knowledge distillation method where the student mimics teacher features instead of soft logits, using a new loss based on locality-sensitive hashing to better capture feature directions and improve accuracy.
Contribution
It introduces feature mimicking for knowledge distillation, decomposes features into magnitude and direction, and employs LSH-based loss to enhance direction matching, achieving state-of-the-art results.
Findings
Feature mimicking outperforms traditional KD methods.
LSH-based loss improves feature direction matching.
Method extends effectively to multi-label recognition and object detection.
Abstract
Knowledge distillation (KD) is a popular method to train efficient networks ("student") with the help of high-capacity networks ("teacher"). Traditional methods use the teacher's soft logits as extra supervision to train the student network. In this paper, we argue that it is more advantageous to make the student mimic the teacher's features in the penultimate layer. Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be applied for teachers trained without a softmax layer. Experiments show that it can achieve higher accuracy than traditional KD. To further facilitate feature mimicking, we decompose a feature vector into the magnitude and the direction. We argue that the teacher should give more freedom to the student feature's magnitude, and let the student pay more attention on mimicking the feature direction. To meet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax
