Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation
Yuanxin Liu, Fandong Meng, Zheng Lin, Weiping Wang, Jie, Zhou

TL;DR
This paper investigates the diminishing returns of hidden state knowledge distillation in BERT compression, proposing an efficient method that uses minimal knowledge to achieve comparable performance and significantly speeds up training.
Contribution
It reveals the marginal utility of distilling all hidden states in BERT and introduces a new KD paradigm that is both efficient and effective by focusing on crucial knowledge.
Findings
Distilling all hidden states yields diminishing performance gains.
A small fraction of hidden state knowledge suffices for optimal performance.
The proposed KD method accelerates training by 2.7x to 3.4x.
Abstract
Recently, knowledge distillation (KD) has shown great success in BERT compression. Instead of only learning from the teacher's soft label as in conventional KD, researchers find that the rich information contained in the hidden layers of BERT is conducive to the student's performance. To better exploit the hidden knowledge, a common practice is to force the student to deeply mimic the teacher's hidden states of all the tokens in a layer-wise manner. In this paper, however, we observe that although distilling the teacher's hidden state knowledge (HSK) is helpful, the performance gain (marginal utility) diminishes quickly as more HSK is distilled. To understand this effect, we conduct a series of analysis. Specifically, we divide the HSK of BERT into three dimensions, namely depth, length and width. We first investigate a variety of strategies to extract crucial knowledge for each single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Ferroelectric and Negative Capacitance Devices · Data Stream Mining Techniques
MethodsMulti-Head Attention · Linear Layer · Knowledge Distillation · Attention Is All You Need · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Residual Connection · WordPiece · Attention Dropout
