Continuation KD: Improved Knowledge Distillation through the Lens of Continuation Optimization
Aref Jafari, Ivan Kobyzev, Mehdi Rezagholizadeh, Pascal Poupart, Ali, Ghodsi

TL;DR
Continuation-KD introduces a continuation optimization approach to knowledge distillation, progressively refining the training process to handle capacity gaps and noisy teacher outputs, leading to improved performance in NLP and vision tasks.
Contribution
It proposes a novel continuation optimization-based training procedure for KD that enhances effectiveness by smoothing the objective and gradually increasing complexity.
Findings
Achieves state-of-the-art results on GLUE benchmark
Outperforms previous KD methods on CIFAR-10 and CIFAR-100
Effectively mitigates teacher noise and capacity gap issues
Abstract
Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) generalization by transferring the knowledge from a larger model (a teacher). Although KD methods achieve state-of-the-art performance in numerous settings, they suffer from several problems limiting their performance. It is shown in the literature that the capacity gap between the teacher and the student networks can make KD ineffective. Additionally, existing KD techniques do not mitigate the noise in the teacher's output: modeling the noisy behaviour of the teacher can distract the student from learning more useful features. We propose a new KD method that addresses these problems and facilitates the training compared to previous techniques. Inspired by continuation optimization, we design a training procedure that optimizes the highly non-convex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
