TL;DR
This paper introduces spot-adaptive knowledge distillation (SAKD), a method that dynamically selects the layers for distillation based on each sample and training epoch, enhancing the performance of existing distillation techniques.
Contribution
The paper proposes a novel adaptive distillation strategy that determines distillation spots per sample and epoch, improving upon fixed-spot methods and integrating seamlessly with existing distillers.
Findings
SAKD improves performance across 10 state-of-the-art distillers.
It enhances distillation in both homogeneous and heterogeneous settings.
Experimental results validate the effectiveness of adaptive spot selection.
Abstract
Knowledge distillation (KD) has become a well established paradigm for compressing deep neural networks. The typical way of conducting knowledge distillation is to train the student network under the supervision of the teacher network to harness the knowledge at one or multiple spots (i.e., layers) in the teacher network. The distillation spots, once specified, will not change for all the training samples, throughout the whole distillation process. In this work, we argue that distillation spots should be adaptive to training samples and distillation epochs. We thus propose a new distillation strategy, termed spot-adaptive KD (SAKD), to adaptively determine the distillation spots in the teacher network per sample, at every training iteration during the whole distillation period. As SAKD actually focuses on "where to distill" instead of "what to distill" that is widely investigated by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation
