Understanding and Overcoming the Challenges of Efficient Transformer   Quantization

Yelysei Bondarenko; Markus Nagel; Tijmen Blankevoort

arXiv:2109.12948·cs.LG·September 28, 2021

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort

PDF

Open Access 1 Repo

TL;DR

This paper investigates the challenges of quantizing transformer models, proposes novel solutions including per-embedding-group quantization, and demonstrates state-of-the-art results on the GLUE benchmark with significant memory savings.

Contribution

It introduces new quantization techniques tailored for transformers, addressing activation outliers and structured patterns, and achieves improved post-training quantization performance.

Findings

01

State-of-the-art post-training quantization results on GLUE with BERT.

02

Effective ultra-low bit-width quantization with minimal accuracy loss.

03

Novel per-embedding-group quantization scheme enhances model compression.

Abstract

Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges -- namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token. To combat these challenges, we present three solutions based on post-training quantization and quantization-aware training, each with a different set of compromises for accuracy, model size, and ease of use. In particular, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qualcomm-ai-research/transformer-quantization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Weight Decay · Linear Warmup With Linear Decay · Residual Connection · Softmax · Dropout