Why Can You Lay Off Heads? Investigating How BERT Heads Transfer
Ting-Rui Chiang, Yun-Nung Chen

TL;DR
This paper investigates the transferability and importance of BERT model heads during distillation, analyzing how pruning affects performance on pre-training and downstream tasks to guide future model compression efforts.
Contribution
It provides a detailed analysis of head importance and transfer coherence in BERT models, offering insights into effective distillation and pruning strategies.
Findings
Prunability of Transformer heads varies across models.
Importance coherence between pre-training and downstream tasks is limited.
Pruned models retain performance after fine-tuning, guiding distillation.
Abstract
The huge size of the widely used BERT family models has led to recent efforts about model distillation. The main goal of distillation is to create a task-agnostic pre-trained model that can be fine-tuned on downstream tasks without fine-tuning its full-sized version. Despite the progress of distillation, to what degree and for what reason a task-agnostic model can be created from distillation has not been well studied. Also, the mechanisms behind transfer learning of those BERT models are not well investigated either. Therefore, this work focuses on analyzing the acceptable deduction when distillation for guiding the future distillation procedure. Specifically, we first inspect the prunability of the Transformer heads in RoBERTa and ALBERT using their head importance estimation proposed by Michel et al. (2019), and then check the coherence of the important heads between the pre-trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutomotive and Human Injury Biomechanics
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · LAMB
