A Survey of Quantization Methods for Efficient Neural Network Inference
Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney,, Kurt Keutzer

TL;DR
This survey reviews various quantization techniques for neural network inference, highlighting their benefits and limitations, to aid future research in reducing model size and computational cost.
Contribution
It provides a comprehensive overview and organization of current quantization methods for neural networks, facilitating evaluation and advancement in the field.
Findings
Quantization can significantly reduce memory and latency in neural networks.
Different methods have trade-offs between accuracy and efficiency.
The survey organizes current research to guide future developments.
Abstract
As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Medical Image Segmentation Techniques · Stochastic Gradient Optimization Techniques
