SONIQ: System-Optimized Noise-Injected Ultra-Low-Precision Quantization with Full-Precision Parity
Cyrus Zhou, Pedro Savarese, Zack Hassman, Vaughn Richard, Michael DiBrino, Michael Maire, Yanjing Li

TL;DR
SONIQ introduces a system-optimized, noise-injected quantization method that learns mixed precision per channel, achieving ultra-low-precision inference with accuracy comparable to full-precision models and significant speedups on standard hardware.
Contribution
It presents SONIQ, a novel framework that learns per-channel mixed precision quantization with hardware-calibrated noise injection, enabling ultra-low-precision inference without specialized hardware.
Findings
Achieves up to 16x compression for CNNs and 7x for Transformers.
Matches or exceeds full-precision accuracy at ultra-low bit regimes.
Delivers up to 7.3x CPU and 6.3x GPU speedups over baselines.
Abstract
Ultra-low-precision inference can sharply reduce memory and latency but often degrades accuracy and relies on specialized hardware. We present SONIQ, a system-optimized, noise-injected quantization framework that learns per-channel mixed precision for both weights and activations while training under the same rules used at inference. By injecting hardware-calibrated quantization noise during training, SONIQ steers models toward the discrete arithmetic used at deployment -- without bespoke runtimes. Across CNNs and Transformers, SONIQ achieves up to 16x and 7x compression, respectively, while matching or exceeding full-precision accuracy. Measured end-to-end, SONIQ delivers up to 7.3x CPU speedup over strong INT8 baselines and up to 6.3x (vector units) / 2.8x (tensor cores) GPU speedup relative to FP16. A practical outcome is that two per-channel precision levels -- one in the 1--4-bit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Neural Networks and Applications · Machine Learning and ELM
MethodsALIGN
