Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference
Tomer Gafni, Asaf Karnieli, Yair Hanani

TL;DR
This paper introduces a dual precision quantization scheme that combines 4-bit weights with 8-bit floating-point inference to enhance neural network efficiency while preserving accuracy.
Contribution
The paper proposes a novel hardware-efficient quantization method called Dual Precision Quantization (DPQ) that minimizes accuracy loss in low-precision neural network inference.
Findings
Significant speedups over 16-bit operations.
Maintains accuracy close to full-precision models.
Applicable across various modern accelerators.
Abstract
Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to grow, posing challenges in latency and memory efficiency. To meet these constraints, post-training quantization has emerged as a promising solution. In this paper, we propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation. Specifically, we introduce a W4A8 scheme, where weights are quantized and stored using 4-bit integer precision, and inference computations are performed using 8-bit floating-point arithmetic, demonstrating significant speedups and improved memory utilization compared to 16-bit operations, applicable on various modern accelerators. To mitigate accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Advanced Image Processing Techniques
