RNN-T For Latency Controlled ASR With Improved Beam Search
Mahaveer Jain, Kjell Schubert, Jay Mahadeokar, Ching-Feng Yeh,, Kaustubh Kalgaonkar, Anuroop Sriram, Christian Fuegen, Michael L. Seltzer

TL;DR
This paper explores latency-controlled RNN Transducers for speech recognition, improving beam search decoding speed and demonstrating comparable accuracy with enhanced efficiency on English video datasets.
Contribution
It introduces a latency-tuning mechanism for RNN-T and enhances the beam search algorithm for faster decoding, advancing real-time ASR applications.
Findings
Achieved comparable WER to hybrid systems
Improved decoding speed of RNN-T beam search
Demonstrated efficiency on English video datasets
Abstract
Neural transducer-based systems such as RNN Transducers (RNN-T) for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR systems (acoustic model, language model, punctuation model, inverse text normalization) into one single model. This greatly simplifies training and inference and hence makes RNN-T a desirable choice for ASR systems. In this work, we investigate use of RNN-T in applications that require a tune-able latency budget during inference time. We also improved the decoding speed of the originally proposed RNN-T beam search algorithm. We evaluated our proposed system on English videos ASR dataset and show that neural RNN-T models can achieve comparable WER and better computational efficiency compared to a well tuned hybrid ASR baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
