The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for   Large Language Models

Eldar Kurtic; Daniel Campos; Tuan Nguyen; Elias Frantar; Mark Kurtz,; Benjamin Fineran; Michael Goin; Dan Alistarh

arXiv:2203.07259·cs.CL·October 19, 2022

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz,, Benjamin Fineran, Michael Goin, Dan Alistarh

PDF

1 Repo 10 Models

TL;DR

This paper introduces oBERT, a second-order pruning method for BERT that achieves high compression and speedup with minimal accuracy loss, enabling efficient deployment on edge devices.

Contribution

The paper presents a novel second-order pruning technique for BERT, extending existing methods to prune blocks of weights and scale to large models, with state-of-the-art results.

Findings

01

10x model size reduction with <1% accuracy loss

02

10x inference speedup with <2% accuracy loss

03

29x inference speedup with <7.5% accuracy loss

Abstract

Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to decrease model size and increase inference speed, with low accuracy loss. In this context, this paper's contributions are two-fold. We perform an in-depth study of the accuracy-compression trade-off for unstructured weight pruning of BERT models. We introduce Optimal BERT Surgeon (oBERT), an efficient and accurate weight pruning method based on approximate second-order information, which we show to yield state-of-the-art results in both stages of language tasks: pre-training and fine-tuning. Specifically, oBERT extends existing work on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

neuralmagic/sparseml
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Pruning · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Dense Connections · Attention Is All You Need · Residual Connection · Weight Decay · Layer Normalization · Linear Warmup With Linear Decay