Quantized Neural Network Inference with Precision Batching
Maximilian Lam, Zachary Yedidia, Colby Banbury, Vijay Janapa Reddi

TL;DR
PrecisionBatching is a novel quantized inference algorithm that accelerates neural network execution on standard hardware by decomposing networks into bitlayers, enabling low-bitwidth inference without retraining and achieving significant speedups.
Contribution
The paper introduces PrecisionBatching, a method for low-bitwidth neural network inference that does not require retraining and allows flexible tradeoffs between accuracy and speed.
Findings
Over 8x speedup on GPU with <1% error margin
Outperforms traditional 8-bit quantization by 1.5-2x
Effective across various architectures and applications
Abstract
We present PrecisionBatching, a quantized inference algorithm for speeding up neural network execution on traditional hardware platforms at low bitwidths without the need for retraining or recalibration. PrecisionBatching decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations while maintaining activations in full precision. PrecisionBatching not only facilitates quantized inference at low bitwidths (< 8 bits) without the need for retraining/recalibration, but also 1) enables traditional hardware platforms the ability to realize inference speedups at a finer granularity of quantization (e.g: 1-16 bit execution) and 2) allows accuracy and speedup tradeoffs at runtime by exposing the number of bitlayers to accumulate as a tunable parameter. Across a variety of applications (MNIST, language modeling, natural language inference) and neural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Neural Networks and Applications · Domain Adaptation and Few-Shot Learning
