Knowledge Distillation with Deep Supervision
Shiya Luo, Defang Chen, Can Wang

TL;DR
This paper introduces Deeply-Supervised Knowledge Distillation (DSKD), a method that improves student model training by fully utilizing teacher knowledge at multiple layers with adaptive weighting, leading to better performance.
Contribution
The paper proposes DSKD, a novel layer-wise supervision approach with adaptive loss weighting, enhancing knowledge transfer in model distillation.
Findings
Significant performance improvements on CIFAR-100 and TinyImageNet
Effective layer-wise supervision strategy demonstrated
Adaptive weight allocation enhances training efficiency
Abstract
Knowledge distillation aims to enhance the performance of a lightweight student model by exploiting the knowledge from a pre-trained cumbersome teacher model. However, in the traditional knowledge distillation, teacher predictions are only used to provide the supervisory signal for the last layer of the student model, which may result in those shallow student layers lacking accurate training guidance in the layer-by-layer back propagation and thus hinders effective knowledge transfer. To address this issue, we propose Deeply-Supervised Knowledge Distillation (DSKD), which fully utilizes class predictions and feature maps of the teacher model to supervise the training of shallow student layers. A loss-based weight allocation strategy is developed in DSKD to adaptively balance the learning process of each shallow layer, so as to further improve the student performance. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification
MethodsKnowledge Distillation
