Annealing Knowledge Distillation
Aref Jafari, Mehdi Rezagholizadeh, Pranav Sharma, Ali Ghodsi

TL;DR
This paper introduces Annealing-KD, a novel knowledge distillation method that incrementally transfers rich soft-target information from teacher to student, improving training efficiency and performance on various tasks.
Contribution
It proposes an annealing-based approach to gradually transfer knowledge, addressing the difficulty of training with large teacher-student gaps in knowledge distillation.
Findings
Consistently outperforms traditional KD on image classification tasks.
Achieves superior results on NLP benchmarks with BERT models.
Theoretically and empirically validates the effectiveness of annealed soft-targets.
Abstract
Significant memory and computational requirements of large deep neural networks restrict their application on edge devices. Knowledge distillation (KD) is a prominent model compression technique for deep neural networks in which the knowledge of a trained large teacher model is transferred to a smaller student model. The success of knowledge distillation is mainly attributed to its training objective function, which exploits the soft-target information (also known as "dark knowledge") besides the given regular hard labels in a training set. However, it is shown in the literature that the larger the gap between the teacher and the student networks, the more difficult is their training using knowledge distillation. To address this shortcoming, we propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
MethodsKnowledge Distillation
