Improving RNN-Transducers with Acoustic LookAhead
Vinit S. Unni, Ashish Mittal, Preethi Jyothi, Sunita Sarawagi

TL;DR
This paper introduces LookAhead, a method to enhance RNN-Transducers by making text representations more acoustically grounded, resulting in significant improvements in speech-to-text accuracy across various datasets.
Contribution
We propose LookAhead, a novel technique that incorporates future audio context into text representations in RNN-T models, reducing hallucinations and improving accuracy.
Findings
5%-20% relative WER reduction on evaluation sets
Improved robustness to out-of-domain data
Enhanced acoustic grounding of text representations
Abstract
RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion because of their high accuracy and streaming capabilities. A typical RNN-T independently encodes the input audio and the text context, and combines the two encodings by a thin joint network. While this architecture provides SOTA streaming accuracy, it also makes the model vulnerable to strong LM biasing which manifests as multi-step hallucination of text without acoustic evidence. In this paper we propose LookAhead that makes text representations more acoustically grounded by looking ahead into the future within the audio input. This technique yields a significant 5%-20% relative reduction in word error rate on both in-domain and out-of-domain evaluation sets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
