ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong, Li, Yuxiong He

TL;DR
ZeroQuant is a novel post-training quantization method that efficiently compresses large Transformer models to INT8 and INT4 precision, significantly reducing memory and computation costs while maintaining accuracy.
Contribution
The paper introduces ZeroQuant, an end-to-end quantization pipeline with a hardware-friendly scheme, a layer-wise knowledge distillation method without training data, and optimized backend support.
Findings
ZeroQuant reduces weights and activations to INT8 with minimal accuracy loss.
Achieves up to 5.19x/4.16x speedup on BERT and GPT-3 models.
Enables quantization of large models like GPT-J6B and GPT-NeoX20 with similar accuracy and 5.2x efficiency gain.
Abstract
How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements. In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is an end-to-end quantization and inference pipeline with three main components: (1) a fine-grained hardware-friendly quantization scheme for both weight and activations; (2) a novel affordable layer-by-layer knowledge distillation algorithm (LKD) even without the access to the original training data; (3) a highly-optimized quantization system backend support to remove the quantization/dequantization overhead. As such, we are able to show that: (1) ZeroQuant can reduce the precision for weights and activations to INT8 in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Adversarial Robustness in Machine Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Softmax · Layer Normalization · Attention Dropout · WordPiece · Adam
