TL;DR
This paper introduces a block pruning method for Transformers that effectively reduces model size and increases speed by pruning entire components like attention heads, achieving significant efficiency gains with minimal accuracy loss.
Contribution
It extends structured pruning to consider blocks of any size and integrates this into movement pruning for fine-tuning, enabling more effective model compression and acceleration.
Findings
Achieved a 2.4x faster, 74% smaller BERT on SQuAD v1
Pruned models retain 99% of original F1 score
Method outperforms or matches speed of distilled models
Abstract
Pre-training has improved model accuracy for both classification and generation tasks at the cost of introducing much larger and slower models. Pruning methods have proven to be an effective way of reducing model size, whereas distillation methods are proven for speeding up inference. We introduce a block pruning approach targeting both small and fast models. Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning paradigm for fine-tuning. We find that this approach learns to prune out full components of the underlying model, such as attention heads. Experiments consider classification and generation tasks, yielding among other results a pruned model that is a 2.4x faster, 74% smaller BERT on SQuAD v1, with a 1% drop on F1, competitive both with distilled models in speed and pruned models in size.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Linear Warmup With Linear Decay · Softmax · Attention Dropout · Dense Connections · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?
