How to Distill your BERT: An Empirical Study on the Impact of Weight   Initialisation and Distillation Objectives

Xinpeng Wang; Leonie Weissweiler; Hinrich Sch\"utze; Barbara Plank

arXiv:2305.15032·cs.CL·May 25, 2023·1 cites

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Xinpeng Wang, Leonie Weissweiler, Hinrich Sch\"utze, Barbara Plank

PDF

Open Access 1 Repo

TL;DR

This paper empirically evaluates various intermediate layer distillation objectives for compressing BERT models, highlighting the effectiveness of attention transfer and the importance of layer initialization in task-specific settings.

Contribution

It provides the first comprehensive evaluation of distillation objectives in both task-specific and task-agnostic contexts, revealing insights on layer initialization impacts.

Findings

01

Attention transfer yields the best overall performance.

02

Lower-layer initialization improves task-specific distillation results.

03

Attention transfer remains consistent across different initializations.

Abstract

Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers, finding a significant impact on the performance in task-specific distillation. For vanilla KD and hidden states transfer, initialisation with lower layers of the teacher gives a considerable improvement over higher layers, especially on the task of QNLI (up to an absolute percentage change of 17.8 in accuracy). Attention transfer behaves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mainlp/how-to-distill-your-bert
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ferroelectric and Negative Capacitance Devices · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Softmax · Layer Normalization · Dropout · Linear Layer · Attention Dropout