Accelerating Inference and Language Model Fusion of Recurrent Neural   Network Transducers via End-to-End 4-bit Quantization

Andrea Fasoli; Chia-Yu Chen; Mauricio Serrano; Swagath Venkataramani,; George Saon; Xiaodong Cui; Brian Kingsbury; Kailash Gopalakrishnan

arXiv:2206.07882·cs.CL·June 17, 2022·1 cites

Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization

Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Swagath Venkataramani,, George Saon, Xiaodong Cui, Brian Kingsbury, Kailash Gopalakrishnan

PDF

Open Access

TL;DR

This paper introduces aggressive 4-bit quantization with quantization-aware training for RNN-T models, enabling significant inference acceleration, model compression, and maintained accuracy, especially with large beam widths and language model fusion.

Contribution

It presents novel 4-bit quantization strategies tailored to RNN-T models, achieving near-iso-accuracy, substantial speedup, and enabling large beam width inference with minimal accuracy loss.

Findings

01

3.4× acceleration from FP16 to INT4 in hardware simulations

02

7.6× model compression ratio with quantization

03

>1.5% WER improvement on test sets

Abstract

We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T). We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model (acoustic encoder and language model) and achieve near-iso-accuracy. We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance while limiting the computational overhead of QAT. Density ratio Language Model fusion has shown remarkable accuracy gains on RNN-T workloads but it severely increases the computational cost of inference. We show that our quantization strategies enable using large beam widths for hypothesis search while achieving streaming-compatible runtimes and a full model compression ratio of 7.6 $\times$ compared to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeophysical Methods and Applications · Speech Recognition and Synthesis · Advanced Neural Network Applications

MethodsTest · Attentive Walk-Aggregating Graph Neural Network