Sub-8-bit quantization for on-device speech recognition: a   regularization-free approach

Kai Zhen; Martin Radfar; Hieu Duy Nguyen; Grant P. Strimel; Nathan; Susanj; Athanasios Mouchtaris

arXiv:2210.09188·cs.SD·November 2, 2022

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

Kai Zhen, Martin Radfar, Hieu Duy Nguyen, Grant P. Strimel, Nathan, Susanj, Athanasios Mouchtaris

PDF

Open Access

TL;DR

This paper introduces GQ, a regularization-free quantization method with self-adjustable centroids, enabling sub-8-bit compression of speech recognition models without accuracy loss, significantly reducing memory and latency.

Contribution

The paper proposes GQ, a novel quantization scheme that eliminates the need for fixed centroids, improving efficiency and versatility in on-device speech recognition models.

Findings

01

GQ compresses models to sub-8-bit without accuracy loss.

02

Achieves 30.73% memory savings and 31.75% latency reduction.

03

Effective on RNN-T and Conformer architectures.

Abstract

For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a mu-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Neural Networks and Applications

MethodsAttentive Walk-Aggregating Graph Neural Network