BERT Learns to Teach: Knowledge Distillation with Meta Learning

Wangchunshu Zhou; Canwen Xu; Julian McAuley

arXiv:2106.04570·cs.LG·April 5, 2022

BERT Learns to Teach: Knowledge Distillation with Meta Learning

Wangchunshu Zhou, Canwen Xu, Julian McAuley

PDF

Open Access 1 Repo

TL;DR

This paper introduces MetaDistil, a meta learning approach where the teacher model learns to improve knowledge transfer to the student, outperforming traditional methods and being more robust across tasks and model sizes.

Contribution

The paper proposes a novel meta learning framework for knowledge distillation where the teacher learns to teach, along with a pilot update mechanism to enhance training alignment.

Findings

01

MetaDistil outperforms traditional KD methods on various benchmarks.

02

It is less sensitive to student capacity and hyperparameters.

03

The approach facilitates knowledge distillation across diverse tasks and models.

Abstract

We present Knowledge Distillation with Meta Learning (MetaDistil), a simple yet effective alternative to traditional knowledge distillation (KD) methods where the teacher model is fixed during training. We show the teacher network can learn to better transfer knowledge to the student network (i.e., learning to teach) with the feedback from the performance of the distilled student network in a meta learning framework. Moreover, we introduce a pilot update mechanism to improve the alignment between the inner-learner and meta-learner in meta learning algorithms that focus on an improved inner-learner. Experiments on various benchmarks show that MetaDistil can yield significant improvements compared with traditional KD algorithms and is less sensitive to the choice of different student capacity and hyperparameters, facilitating the use of KD on different tasks and models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JetRunner/MetaDistil
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Machine Learning and ELM

MethodsKnowledge Distillation