Continuation KD: Improved Knowledge Distillation through the Lens of   Continuation Optimization

Aref Jafari; Ivan Kobyzev; Mehdi Rezagholizadeh; Pascal Poupart; Ali; Ghodsi

arXiv:2212.05998·cs.LG·December 13, 2022·1 cites

Continuation KD: Improved Knowledge Distillation through the Lens of Continuation Optimization

Aref Jafari, Ivan Kobyzev, Mehdi Rezagholizadeh, Pascal Poupart, Ali, Ghodsi

PDF

Open Access

TL;DR

Continuation-KD introduces a continuation optimization approach to knowledge distillation, progressively refining the training process to handle capacity gaps and noisy teacher outputs, leading to improved performance in NLP and vision tasks.

Contribution

It proposes a novel continuation optimization-based training procedure for KD that enhances effectiveness by smoothing the objective and gradually increasing complexity.

Findings

01

Achieves state-of-the-art results on GLUE benchmark

02

Outperforms previous KD methods on CIFAR-10 and CIFAR-100

03

Effectively mitigates teacher noise and capacity gap issues

Abstract

Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) generalization by transferring the knowledge from a larger model (a teacher). Although KD methods achieve state-of-the-art performance in numerous settings, they suffer from several problems limiting their performance. It is shown in the literature that the capacity gap between the teacher and the student networks can make KD ineffective. Additionally, existing KD techniques do not mitigate the noise in the teacher's output: modeling the noisy behaviour of the teacher can distract the student from learning more useful features. We propose a new KD method that addresses these problems and facilitates the training compared to previous techniques. Inspired by continuation optimization, we design a training procedure that optimizes the highly non-convex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning