Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for Natural Language Understanding
Connor Holmes, Minjia Zhang, Yuxiong He, Bo Wu

TL;DR
This paper introduces NxMiFormer, a flexible compression framework for pre-trained Transformers that combines sparsification and quantization, achieving high compression ratios with minimal accuracy loss on NLP benchmarks.
Contribution
It systematically investigates the benefits of N:M sparsity and low-bit quantization for Transformer compression and proposes a heuristic search for optimal configurations.
Findings
Achieves up to 93% encoder compression with 98.2% accuracy retention.
Heterogeneous configurations maintain 99.5% accuracy with 87.5% compression.
Effectively leverages hardware support for sparsity and low-precision computation.
Abstract
In recent years, large pre-trained Transformer networks have demonstrated dramatic improvements in many natural language understanding tasks. However, the huge size of these models brings significant challenges to their fine-tuning and online deployment due to latency and cost constraints. New hardware supporting both N:M semi-structured sparsity and low-precision integer computation is a promising solution to boost DNN model serving efficiency. However, there have been very few studies that systematically investigate to what extent pre-trained Transformer networks benefit from the combination of these techniques, as well as how to best compress each component of the Transformer. We propose a flexible compression framework NxMiFormer that performs simultaneous sparsification and quantization using ADMM and STE-based QAT. Furthermore, we present and inexpensive, heuristic-driven search…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Weight Decay · WordPiece · Transformer · Softmax
