TL;DR
This paper introduces oBERT, a second-order pruning method for BERT that achieves high compression and speedup with minimal accuracy loss, enabling efficient deployment on edge devices.
Contribution
The paper presents a novel second-order pruning technique for BERT, extending existing methods to prune blocks of weights and scale to large models, with state-of-the-art results.
Findings
10x model size reduction with <1% accuracy loss
10x inference speedup with <2% accuracy loss
29x inference speedup with <7.5% accuracy loss
Abstract
Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to decrease model size and increase inference speed, with low accuracy loss. In this context, this paper's contributions are two-fold. We perform an in-depth study of the accuracy-compression trade-off for unstructured weight pruning of BERT models. We introduce Optimal BERT Surgeon (oBERT), an efficient and accurate weight pruning method based on approximate second-order information, which we show to yield state-of-the-art results in both stages of language tasks: pre-training and fine-tuning. Specifically, oBERT extends existing work on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗RedHatAI/oBERT-teacher-squadv1model· 17 dl17 dl
- 🤗RedHatAI/oBERT-12-downstream-pruned-unstructured-80-squadv1model· 10 dl10 dl
- 🤗RedHatAI/oBERT-12-downstream-pruned-unstructured-90-squadv1model· 9 dl9 dl
- 🤗RedHatAI/oBERT-12-downstream-pruned-unstructured-97-squadv1model· 14 dl14 dl
- 🤗RedHatAI/oBERT-teacher-mnlimodel· 11 dl11 dl
- 🤗RedHatAI/oBERT-12-downstream-pruned-unstructured-80-mnlimodel· 9 dl9 dl
- 🤗RedHatAI/oBERT-12-downstream-pruned-unstructured-90-mnlimodel· 11 dl11 dl
- 🤗RedHatAI/oBERT-12-downstream-pruned-unstructured-97-mnlimodel· 7 dl7 dl
- 🤗RedHatAI/oBERT-teacher-qqpmodel· 6 dl6 dl
- 🤗RedHatAI/oBERT-12-downstream-pruned-unstructured-80-qqpmodel· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Pruning · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Dense Connections · Attention Is All You Need · Residual Connection · Weight Decay · Layer Normalization · Linear Warmup With Linear Decay
