Annealing Knowledge Distillation

Aref Jafari; Mehdi Rezagholizadeh; Pranav Sharma; Ali Ghodsi

arXiv:2104.07163·cs.CL·April 16, 2021

Annealing Knowledge Distillation

Aref Jafari, Mehdi Rezagholizadeh, Pranav Sharma, Ali Ghodsi

PDF

Open Access 1 Repo

TL;DR

This paper introduces Annealing-KD, a novel knowledge distillation method that incrementally transfers rich soft-target information from teacher to student, improving training efficiency and performance on various tasks.

Contribution

It proposes an annealing-based approach to gradually transfer knowledge, addressing the difficulty of training with large teacher-student gaps in knowledge distillation.

Findings

01

Consistently outperforms traditional KD on image classification tasks.

02

Achieves superior results on NLP benchmarks with BERT models.

03

Theoretically and empirically validates the effectiveness of annealed soft-targets.

Abstract

Significant memory and computational requirements of large deep neural networks restrict their application on edge devices. Knowledge distillation (KD) is a prominent model compression technique for deep neural networks in which the knowledge of a trained large teacher model is transferred to a smaller student model. The success of knowledge distillation is mainly attributed to its training objective function, which exploits the soft-target information (also known as "dark knowledge") besides the given regular hard labels in a training set. However, it is shown in the literature that the larger the gap between the teacher and the student networks, the more difficult is their training using knowledge distillation. To address this shortcoming, we propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huawei-noah/KD-NLP/tree/main/Annealing_KD
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning

MethodsKnowledge Distillation