Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference

Tomer Gafni; Asaf Karnieli; Yair Hanani

arXiv:2505.14638·cs.CV·May 21, 2025

Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference

Tomer Gafni, Asaf Karnieli, Yair Hanani

PDF

Open Access 1 Repo

TL;DR

This paper introduces a dual precision quantization scheme that combines 4-bit weights with 8-bit floating-point inference to enhance neural network efficiency while preserving accuracy.

Contribution

The paper proposes a novel hardware-efficient quantization method called Dual Precision Quantization (DPQ) that minimizes accuracy loss in low-precision neural network inference.

Findings

01

Significant speedups over 16-bit operations.

02

Maintains accuracy close to full-precision models.

03

Applicable across various modern accelerators.

Abstract

Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to grow, posing challenges in latency and memory efficiency. To meet these constraints, post-training quantization has emerged as a promising solution. In this paper, we propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation. Specifically, we introduce a W4A8 scheme, where weights are quantized and stored using 4-bit integer precision, and inference computations are performed using 8-bit floating-point arithmetic, demonstrating significant speedups and improved memory utilization compared to 16-bit operations, applicable on various modern accelerators. To mitigate accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

intel/neural-compressor
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Advanced Image Processing Techniques