Prune Once for All: Sparse Pre-Trained Language Models

Ofir Zafrir; Ariel Larey; Guy Boudoukh; Haihao Shen; Moshe Wasserblat

arXiv:2111.05754·cs.CL·November 11, 2021·26 cites

Prune Once for All: Sparse Pre-Trained Language Models

Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, Moshe Wasserblat

PDF

Open Access 2 Repos 8 Models

TL;DR

This paper introduces a method to create sparse, pre-trained Transformer language models through combined pruning and distillation, enabling efficient transfer learning with minimal accuracy loss and further compression via quantization.

Contribution

It presents a novel approach for training sparse pre-trained Transformer models that maintain transferability and achieve high compression ratios with minimal accuracy loss.

Findings

01

Achieved up to 40x compression with less than 1% accuracy loss.

02

Created sparse pre-trained BERT and DistilBERT models with state-of-the-art compression ratios.

03

Demonstrated effective transfer learning on five NLP tasks.

Abstract

Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · WordPiece · Weight Decay · Absolute Position Encodings · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer