Block Pruning For Faster Transformers

Fran\c{c}ois Lagunas; Ella Charlaix; Victor Sanh; Alexander M. Rush

arXiv:2109.04838·cs.LG·September 13, 2021

Block Pruning For Faster Transformers

Fran\c{c}ois Lagunas, Ella Charlaix, Victor Sanh, Alexander M. Rush

PDF

1 Repo 2 Models

TL;DR

This paper introduces a block pruning method for Transformers that effectively reduces model size and increases speed by pruning entire components like attention heads, achieving significant efficiency gains with minimal accuracy loss.

Contribution

It extends structured pruning to consider blocks of any size and integrates this into movement pruning for fine-tuning, enabling more effective model compression and acceleration.

Findings

01

Achieved a 2.4x faster, 74% smaller BERT on SQuAD v1

02

Pruned models retain 99% of original F1 score

03

Method outperforms or matches speed of distilled models

Abstract

Pre-training has improved model accuracy for both classification and generation tasks at the cost of introducing much larger and slower models. Pruning methods have proven to be an effective way of reducing model size, whereas distillation methods are proven for speeding up inference. We introduce a block pruning approach targeting both small and fast models. Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning paradigm for fine-tuning. We find that this approach learns to prune out full components of the underlying model, such as attention heads. Experiments consider classification and generation tasks, yielding among other results a pruned model that is a 2.4x faster, 74% smaller BERT on SQuAD v1, with a 1% drop on F1, competitive both with distilled models in speed and pruned models in size.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huggingface/nn_pruning
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Linear Warmup With Linear Decay · Softmax · Attention Dropout · Dense Connections · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?