NormKD: Normalized Logits for Knowledge Distillation
Zhihao Chi, Tu Zheng, Hengjia Li, Zheng Yang, Boxi Wu, Binbin Lin,, Deng Cai

TL;DR
NormKD introduces a sample-specific temperature adjustment in logit-based knowledge distillation, significantly improving performance on image classification tasks without extra computational costs.
Contribution
The paper proposes Normalized Knowledge Distillation (NormKD), a novel method that customizes the temperature for each sample based on its logit distribution, enhancing distillation effectiveness.
Findings
Significantly better performance on CIRAR-100 and ImageNet.
Comparable or superior results to feature-based methods.
No extra computational or storage costs.
Abstract
Logit based knowledge distillation gets less attention in recent years since feature based methods perform better in most cases. Nevertheless, we find it still has untapped potential when we re-investigate the temperature, which is a crucial hyper-parameter to soften the logit outputs. For most of the previous works, it was set as a fixed value for the entire distillation procedure. However, as the logits from different samples are distributed quite variously, it is not feasible to soften all of them to an equal degree by just a single temperature, which may make the previous work transfer the knowledge of each sample inadequately. In this paper, we restudy the hyper-parameter temperature and figure out its incapability to distill the knowledge from each sample sufficiently when it is a single value. To address this issue, we propose Normalized Knowledge Distillation (NormKD), with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Domain Adaptation and Few-Shot Learning
MethodsKnowledge Distillation
