Knowledge Distillation with Deep Supervision

Shiya Luo; Defang Chen; Can Wang

arXiv:2202.07846·cs.LG·May 26, 2023

Knowledge Distillation with Deep Supervision

Shiya Luo, Defang Chen, Can Wang

PDF

Open Access

TL;DR

This paper introduces Deeply-Supervised Knowledge Distillation (DSKD), a method that improves student model training by fully utilizing teacher knowledge at multiple layers with adaptive weighting, leading to better performance.

Contribution

The paper proposes DSKD, a novel layer-wise supervision approach with adaptive loss weighting, enhancing knowledge transfer in model distillation.

Findings

01

Significant performance improvements on CIFAR-100 and TinyImageNet

02

Effective layer-wise supervision strategy demonstrated

03

Adaptive weight allocation enhances training efficiency

Abstract

Knowledge distillation aims to enhance the performance of a lightweight student model by exploiting the knowledge from a pre-trained cumbersome teacher model. However, in the traditional knowledge distillation, teacher predictions are only used to provide the supervisory signal for the last layer of the student model, which may result in those shallow student layers lacking accurate training guidance in the layer-by-layer back propagation and thus hinders effective knowledge transfer. To address this issue, we propose Deeply-Supervised Knowledge Distillation (DSKD), which fully utilizes class predictions and feature maps of the teacher model to supervise the training of shallow student layers. A loss-based weight allocation strategy is developed in DSKD to adaptively balance the learning process of each shallow layer, so as to further improve the student performance. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification

MethodsKnowledge Distillation