ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training   Quantization Framework for W8A8 Transformers

Zhewei Yao; Reza Yazdani Aminabadi; Stephen Youn; Xiaoxia Wu; Elton; Zheng; Yuxiong He

arXiv:2310.17723·cs.LG·October 30, 2023·1 cites

ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers

Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton, Zheng, Yuxiong He

PDF

Open Access

TL;DR

ZeroQuant-HERO introduces a hardware-aware post-training quantization framework for W8A8 transformers, optimizing memory and compute efficiency while allowing mode flexibility to improve accuracy.

Contribution

It is the first to integrate hardware considerations into robust post-training quantization for transformers, addressing memory and operator complexities.

Findings

01

Enhanced hardware performance in W8A8 quantized transformers

02

Flexible mode switching improves quantization accuracy

03

Addresses memory-bounded operators and per-token quantization challenges

Abstract

Quantization techniques are pivotal in reducing the memory and computational demands of deep neural network inference. Existing solutions, such as ZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook crucial memory-bounded operators and the complexities of per-token quantization. Addressing these gaps, we present a novel, fully hardware-enhanced robust optimized post-training W8A8 quantization framework, ZeroQuant-HERO. This framework uniquely integrates both memory bandwidth and compute-intensive operators, aiming for optimal hardware performance. Additionally, it offers flexibility by allowing specific INT8 modules to switch to FP16/BF16 mode, enhancing accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Machine Learning and ELM · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Residual Connection · Byte Pair Encoding · Weight Decay · WordPiece · Softmax · Dense Connections