Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for   Natural Language Understanding

Connor Holmes; Minjia Zhang; Yuxiong He; Bo Wu

arXiv:2206.15014·cs.CL·July 1, 2022

Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for Natural Language Understanding

Connor Holmes, Minjia Zhang, Yuxiong He, Bo Wu

PDF

Open Access

TL;DR

This paper introduces NxMiFormer, a flexible compression framework for pre-trained Transformers that combines sparsification and quantization, achieving high compression ratios with minimal accuracy loss on NLP benchmarks.

Contribution

It systematically investigates the benefits of N:M sparsity and low-bit quantization for Transformer compression and proposes a heuristic search for optimal configurations.

Findings

01

Achieves up to 93% encoder compression with 98.2% accuracy retention.

02

Heterogeneous configurations maintain 99.5% accuracy with 87.5% compression.

03

Effectively leverages hardware support for sparsity and low-precision computation.

Abstract

In recent years, large pre-trained Transformer networks have demonstrated dramatic improvements in many natural language understanding tasks. However, the huge size of these models brings significant challenges to their fine-tuning and online deployment due to latency and cost constraints. New hardware supporting both N:M semi-structured sparsity and low-precision integer computation is a promising solution to boost DNN model serving efficiency. However, there have been very few studies that systematically investigate to what extent pre-trained Transformer networks benefit from the combination of these techniques, as well as how to best compress each component of the Transformer. We propose a flexible compression framework NxMiFormer that performs simultaneous sparsification and quantization using ADMM and STE-based QAT. Furthermore, we present and inexpensive, heuristic-driven search…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Weight Decay · WordPiece · Transformer · Softmax