Understanding and Overcoming the Challenges of Efficient Transformer Quantization
Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort

TL;DR
This paper investigates the challenges of quantizing transformer models, proposes novel solutions including per-embedding-group quantization, and demonstrates state-of-the-art results on the GLUE benchmark with significant memory savings.
Contribution
It introduces new quantization techniques tailored for transformers, addressing activation outliers and structured patterns, and achieves improved post-training quantization performance.
Findings
State-of-the-art post-training quantization results on GLUE with BERT.
Effective ultra-low bit-width quantization with minimal accuracy loss.
Novel per-embedding-group quantization scheme enhances model compression.
Abstract
Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges -- namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token. To combat these challenges, we present three solutions based on post-training quantization and quantization-aware training, each with a different set of compromises for accuracy, model size, and ease of use. In particular, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Weight Decay · Linear Warmup With Linear Decay · Residual Connection · Softmax · Dropout
