Sub-8-bit quantization for on-device speech recognition: a regularization-free approach
Kai Zhen, Martin Radfar, Hieu Duy Nguyen, Grant P. Strimel, Nathan, Susanj, Athanasios Mouchtaris

TL;DR
This paper introduces GQ, a regularization-free quantization method with self-adjustable centroids, enabling sub-8-bit compression of speech recognition models without accuracy loss, significantly reducing memory and latency.
Contribution
The paper proposes GQ, a novel quantization scheme that eliminates the need for fixed centroids, improving efficiency and versatility in on-device speech recognition models.
Findings
GQ compresses models to sub-8-bit without accuracy loss.
Achieves 30.73% memory savings and 31.75% latency reduction.
Effective on RNN-T and Conformer architectures.
Abstract
For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a mu-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Neural Networks and Applications
MethodsAttentive Walk-Aggregating Graph Neural Network
