Prune Once for All: Sparse Pre-Trained Language Models
Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, Moshe Wasserblat

TL;DR
This paper introduces a method to create sparse, pre-trained Transformer language models through combined pruning and distillation, enabling efficient transfer learning with minimal accuracy loss and further compression via quantization.
Contribution
It presents a novel approach for training sparse pre-trained Transformer models that maintain transferability and achieve high compression ratios with minimal accuracy loss.
Findings
Achieved up to 40x compression with less than 1% accuracy loss.
Created sparse pre-trained BERT and DistilBERT models with state-of-the-art compression ratios.
Demonstrated effective transfer learning on five NLP tasks.
Abstract
Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Intel/bert-base-uncased-sparse-85-unstructured-pruneofamodel· 10 dl10 dl
- 🤗Intel/bert-base-uncased-sparse-90-unstructured-pruneofamodel· 21 dl21 dl
- 🤗Intel/bert-base-uncased-squadv1.1-sparse-80-1x4-block-pruneofamodel· 22 dl22 dl
- 🤗Intel/distilbert-base-uncased-sparse-85-unstructured-pruneofamodel· 10 dl10 dl
- 🤗Intel/distilbert-base-uncased-sparse-90-unstructured-pruneofamodel· 19 dl· ♡ 219 dl♡ 2
- 🤗Intel/bert-large-uncased-squadv1.1-sparse-80-1x4-block-pruneofamodel· 8 dl· ♡ 18 dl♡ 1
- 🤗Intel/bert-large-uncased-sparse-80-1x4-block-pruneofamodel· 3 dl3 dl
- 🤗Intel/bert-base-uncased-sparse-80-1x4-block-pruneofamodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · WordPiece · Weight Decay · Absolute Position Encodings · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer
