Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization
Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Swagath Venkataramani,, George Saon, Xiaodong Cui, Brian Kingsbury, Kailash Gopalakrishnan

TL;DR
This paper introduces aggressive 4-bit quantization with quantization-aware training for RNN-T models, enabling significant inference acceleration, model compression, and maintained accuracy, especially with large beam widths and language model fusion.
Contribution
It presents novel 4-bit quantization strategies tailored to RNN-T models, achieving near-iso-accuracy, substantial speedup, and enabling large beam width inference with minimal accuracy loss.
Findings
3.4× acceleration from FP16 to INT4 in hardware simulations
7.6× model compression ratio with quantization
>1.5% WER improvement on test sets
Abstract
We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T). We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model (acoustic encoder and language model) and achieve near-iso-accuracy. We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance while limiting the computational overhead of QAT. Density ratio Language Model fusion has shown remarkable accuracy gains on RNN-T workloads but it severely increases the computational cost of inference. We show that our quantization strategies enable using large beam widths for hypothesis search while achieving streaming-compatible runtimes and a full model compression ratio of 7.6 compared to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeophysical Methods and Applications · Speech Recognition and Synthesis · Advanced Neural Network Applications
MethodsTest · Attentive Walk-Aggregating Graph Neural Network
