Knowledge Distillation from A Stronger Teacher
Tao Huang, Shan You, Fei Wang, Chen Qian, Chang Xu

TL;DR
This paper introduces DIST, a knowledge distillation method that effectively leverages stronger teachers by focusing on relational and intra-class similarities, leading to improved performance across multiple vision tasks.
Contribution
Proposes a novel correlation-based relational loss for distillation from stronger teachers, addressing prediction discrepancy issues in existing methods.
Findings
Achieves state-of-the-art results on image classification, object detection, and segmentation.
Effective across various architectures and training strategies.
Improves student performance by capturing intrinsic inter-class and intra-class relations.
Abstract
Unlike existing knowledge distillation methods focus on the baseline settings, where the teacher models and training strategies are not that strong and competing as state-of-the-art approaches, this paper presents a method dubbed DIST to distill better from a stronger teacher. We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer. As a result, the exact match of predictions in KL divergence would disturb the training and make existing methods perform poorly. In this paper, we show that simply preserving the relations between the predictions of teacher and student would suffice, and propose a correlation-based loss to capture the intrinsic inter-class relations from the teacher explicitly. Besides, considering that different instances have different semantic similarities to each class, we also extend this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
MethodsKnowledge Distillation
