Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation
Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, Paulius, Micikevicius

TL;DR
This paper reviews mathematical principles of integer quantization for deep learning, evaluates various techniques across multiple models, and presents a workflow that maintains high accuracy with 8-bit quantization for diverse neural networks.
Contribution
It provides a comprehensive analysis of quantization parameters and introduces an effective 8-bit quantization workflow that preserves accuracy across different neural network architectures.
Findings
Quantization can significantly reduce model size and improve inference speed.
The proposed workflow maintains within 1% accuracy loss for various models.
Quantization techniques are effective across vision, speech, and language models.
Abstract
Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by taking advantage of high throughput integer instructions. In this paper we review the mathematical aspects of quantization parameters and evaluate their choices on a wide range of neural network models for different application domains, including vision, speech, and language. We focus on quantization techniques that are amenable to acceleration by processors with high-throughput integer math pipelines. We also present a workflow for 8-bit quantization that is able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are more difficult to quantize, such as MobileNets and BERT-large.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
