Quantizing deep convolutional networks for efficient inference: A whitepaper
Raghuraman Krishnamoorthi

TL;DR
This paper reviews techniques for quantizing convolutional neural networks to 8-bit precision, enabling significant reductions in model size and inference latency with minimal accuracy loss, and introduces tools for practical implementation.
Contribution
It provides a comprehensive overview of post-training and quantization-aware training methods, benchmarks their performance on various hardware, and offers best practices and tools for deployment.
Findings
8-bit quantization maintains within 2% accuracy of floating point networks
Quantized models achieve 2x-3x speedup on CPUs and up to 10x on specialized processors
Quantization-aware training reduces accuracy gap to 1% and enables lower precision with acceptable accuracy loss
Abstract
We present an overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations. Per-channel quantization of weights and per-layer quantization of activations to 8-bits of precision post-training produces classification accuracies within 2% of floating point networks for a wide variety of CNN architectures. Model sizes can be reduced by a factor of 4 by quantizing weights to 8-bits, even when 8-bit arithmetic is not supported. This can be achieved with simple, post training quantization of weights.We benchmark latencies of quantized networks on CPUs and DSPs and observe a speedup of 2x-3x for quantized implementations compared to floating point on CPUs. Speedups of up to 10x are observed on specialized processors with fixed point SIMD capabilities, like the Qualcomm QDSPs with HVX. Quantization-aware training can provide further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
