Quantized Neural Network Inference with Precision Batching

Maximilian Lam; Zachary Yedidia; Colby Banbury; Vijay Janapa Reddi

arXiv:2003.00822·cs.LG·March 3, 2020·1 cites

Quantized Neural Network Inference with Precision Batching

Maximilian Lam, Zachary Yedidia, Colby Banbury, Vijay Janapa Reddi

PDF

Open Access

TL;DR

PrecisionBatching is a novel quantized inference algorithm that accelerates neural network execution on standard hardware by decomposing networks into bitlayers, enabling low-bitwidth inference without retraining and achieving significant speedups.

Contribution

The paper introduces PrecisionBatching, a method for low-bitwidth neural network inference that does not require retraining and allows flexible tradeoffs between accuracy and speed.

Findings

01

Over 8x speedup on GPU with <1% error margin

02

Outperforms traditional 8-bit quantization by 1.5-2x

03

Effective across various architectures and applications

Abstract

We present PrecisionBatching, a quantized inference algorithm for speeding up neural network execution on traditional hardware platforms at low bitwidths without the need for retraining or recalibration. PrecisionBatching decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations while maintaining activations in full precision. PrecisionBatching not only facilitates quantized inference at low bitwidths (< 8 bits) without the need for retraining/recalibration, but also 1) enables traditional hardware platforms the ability to realize inference speedups at a finer granularity of quantization (e.g: 1-16 bit execution) and 2) allows accuracy and speedup tradeoffs at runtime by exposing the number of bitlayers to accumulate as a tunable parameter. Across a variety of applications (MNIST, language modeling, natural language inference) and neural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Neural Networks and Applications · Domain Adaptation and Few-Shot Learning