Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT   Knowledge Distillation

Yuanxin Liu; Fandong Meng; Zheng Lin; Weiping Wang; Jie; Zhou

arXiv:2106.05691·cs.CL·June 11, 2021

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

Yuanxin Liu, Fandong Meng, Zheng Lin, Weiping Wang, Jie, Zhou

PDF

Open Access 1 Repo

TL;DR

This paper investigates the diminishing returns of hidden state knowledge distillation in BERT compression, proposing an efficient method that uses minimal knowledge to achieve comparable performance and significantly speeds up training.

Contribution

It reveals the marginal utility of distilling all hidden states in BERT and introduces a new KD paradigm that is both efficient and effective by focusing on crucial knowledge.

Findings

01

Distilling all hidden states yields diminishing performance gains.

02

A small fraction of hidden state knowledge suffices for optimal performance.

03

The proposed KD method accelerates training by 2.7x to 3.4x.

Abstract

Recently, knowledge distillation (KD) has shown great success in BERT compression. Instead of only learning from the teacher's soft label as in conventional KD, researchers find that the rich information contained in the hidden layers of BERT is conducive to the student's performance. To better exploit the hidden knowledge, a common practice is to force the student to deeply mimic the teacher's hidden states of all the tokens in a layer-wise manner. In this paper, however, we observe that although distilling the teacher's hidden state knowledge (HSK) is helpful, the performance gain (marginal utility) diminishes quickly as more HSK is distilled. To understand this effect, we conduct a series of analysis. Specifically, we divide the HSK of BERT into three dimensions, namely depth, length and width. We first investigate a variety of strategies to extract crucial knowledge for each single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llyx97/Marginal-Utility-Diminishes
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Ferroelectric and Negative Capacitance Devices · Data Stream Mining Techniques

MethodsMulti-Head Attention · Linear Layer · Knowledge Distillation · Attention Is All You Need · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Residual Connection · WordPiece · Attention Dropout