SONIQ: System-Optimized Noise-Injected Ultra-Low-Precision Quantization with Full-Precision Parity

Cyrus Zhou; Pedro Savarese; Zack Hassman; Vaughn Richard; Michael DiBrino; Michael Maire; Yanjing Li

arXiv:2311.14114·cs.AR·November 11, 2025·1 cites

SONIQ: System-Optimized Noise-Injected Ultra-Low-Precision Quantization with Full-Precision Parity

Cyrus Zhou, Pedro Savarese, Zack Hassman, Vaughn Richard, Michael DiBrino, Michael Maire, Yanjing Li

PDF

Open Access

TL;DR

SONIQ introduces a system-optimized, noise-injected quantization method that learns mixed precision per channel, achieving ultra-low-precision inference with accuracy comparable to full-precision models and significant speedups on standard hardware.

Contribution

It presents SONIQ, a novel framework that learns per-channel mixed precision quantization with hardware-calibrated noise injection, enabling ultra-low-precision inference without specialized hardware.

Findings

01

Achieves up to 16x compression for CNNs and 7x for Transformers.

02

Matches or exceeds full-precision accuracy at ultra-low bit regimes.

03

Delivers up to 7.3x CPU and 6.3x GPU speedups over baselines.

Abstract

Ultra-low-precision inference can sharply reduce memory and latency but often degrades accuracy and relies on specialized hardware. We present SONIQ, a system-optimized, noise-injected quantization framework that learns per-channel mixed precision for both weights and activations while training under the same rules used at inference. By injecting hardware-calibrated quantization noise during training, SONIQ steers models toward the discrete arithmetic used at deployment -- without bespoke runtimes. Across CNNs and Transformers, SONIQ achieves up to 16x and 7x compression, respectively, while matching or exceeding full-precision accuracy. Measured end-to-end, SONIQ delivers up to 7.3x CPU speedup over strong INT8 baselines and up to 6.3x (vector units) / 2.8x (tensor cores) GPU speedup relative to FP16. A practical outcome is that two per-channel precision levels -- one in the 1--4-bit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Neural Networks and Applications · Machine Learning and ELM

MethodsALIGN