ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers
Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton, Zheng, Yuxiong He

TL;DR
ZeroQuant-HERO introduces a hardware-aware post-training quantization framework for W8A8 transformers, optimizing memory and compute efficiency while allowing mode flexibility to improve accuracy.
Contribution
It is the first to integrate hardware considerations into robust post-training quantization for transformers, addressing memory and operator complexities.
Findings
Enhanced hardware performance in W8A8 quantized transformers
Flexible mode switching improves quantization accuracy
Addresses memory-bounded operators and per-token quantization challenges
Abstract
Quantization techniques are pivotal in reducing the memory and computational demands of deep neural network inference. Existing solutions, such as ZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook crucial memory-bounded operators and the complexities of per-token quantization. Addressing these gaps, we present a novel, fully hardware-enhanced robust optimized post-training W8A8 quantization framework, ZeroQuant-HERO. This framework uniquely integrates both memory bandwidth and compute-intensive operators, aiming for optimal hardware performance. Additionally, it offers flexibility by allowing specific INT8 modules to switch to FP16/BF16 mode, enhancing accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning and ELM · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Residual Connection · Byte Pair Encoding · Weight Decay · WordPiece · Softmax · Dense Connections
