ZeroQuant: Efficient and Affordable Post-Training Quantization for   Large-Scale Transformers

Zhewei Yao; Reza Yazdani Aminabadi; Minjia Zhang; Xiaoxia Wu; Conglong; Li; Yuxiong He

arXiv:2206.01861·cs.CL·June 7, 2022·72 cites

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong, Li, Yuxiong He

PDF

Open Access 3 Repos 1 Video

TL;DR

ZeroQuant is a novel post-training quantization method that efficiently compresses large Transformer models to INT8 and INT4 precision, significantly reducing memory and computation costs while maintaining accuracy.

Contribution

The paper introduces ZeroQuant, an end-to-end quantization pipeline with a hardware-friendly scheme, a layer-wise knowledge distillation method without training data, and optimized backend support.

Findings

01

ZeroQuant reduces weights and activations to INT8 with minimal accuracy loss.

02

Achieves up to 5.19x/4.16x speedup on BERT and GPT-3 models.

03

Enables quantization of large models like GPT-J6B and GPT-NeoX20 with similar accuracy and 5.2x efficiency gain.

Abstract

How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements. In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is an end-to-end quantization and inference pipeline with three main components: (1) a fine-grained hardware-friendly quantization scheme for both weight and activations; (2) a novel affordable layer-by-layer knowledge distillation algorithm (LKD) even without the access to the original training data; (3) a highly-optimized quantization system backend support to remove the quantization/dequantization overhead. As such, we are able to show that: (1) ZeroQuant can reduce the precision for weights and activations to INT8 in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Adversarial Robustness in Machine Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Softmax · Layer Normalization · Attention Dropout · WordPiece · Adam