BERT Learns to Teach: Knowledge Distillation with Meta Learning
Wangchunshu Zhou, Canwen Xu, Julian McAuley

TL;DR
This paper introduces MetaDistil, a meta learning approach where the teacher model learns to improve knowledge transfer to the student, outperforming traditional methods and being more robust across tasks and model sizes.
Contribution
The paper proposes a novel meta learning framework for knowledge distillation where the teacher learns to teach, along with a pilot update mechanism to enhance training alignment.
Findings
MetaDistil outperforms traditional KD methods on various benchmarks.
It is less sensitive to student capacity and hyperparameters.
The approach facilitates knowledge distillation across diverse tasks and models.
Abstract
We present Knowledge Distillation with Meta Learning (MetaDistil), a simple yet effective alternative to traditional knowledge distillation (KD) methods where the teacher model is fixed during training. We show the teacher network can learn to better transfer knowledge to the student network (i.e., learning to teach) with the feedback from the performance of the distilled student network in a meta learning framework. Moreover, we introduce a pilot update mechanism to improve the alignment between the inner-learner and meta-learner in meta learning algorithms that focus on an improved inner-learner. Experiments on various benchmarks show that MetaDistil can yield significant improvements compared with traditional KD algorithms and is less sensitive to the choice of different student capacity and hyperparameters, facilitating the use of KD on different tasks and models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Machine Learning and ELM
MethodsKnowledge Distillation
