How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives
Xinpeng Wang, Leonie Weissweiler, Hinrich Sch\"utze, Barbara Plank

TL;DR
This paper empirically evaluates various intermediate layer distillation objectives for compressing BERT models, highlighting the effectiveness of attention transfer and the importance of layer initialization in task-specific settings.
Contribution
It provides the first comprehensive evaluation of distillation objectives in both task-specific and task-agnostic contexts, revealing insights on layer initialization impacts.
Findings
Attention transfer yields the best overall performance.
Lower-layer initialization improves task-specific distillation results.
Attention transfer remains consistent across different initializations.
Abstract
Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers, finding a significant impact on the performance in task-specific distillation. For vanilla KD and hidden states transfer, initialisation with lower layers of the teacher gives a considerable improvement over higher layers, especially on the task of QNLI (up to an absolute percentage change of 17.8 in accuracy). Attention transfer behaves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ferroelectric and Negative Capacitance Devices · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Softmax · Layer Normalization · Dropout · Linear Layer · Attention Dropout
