Improving RNN-Transducers with Acoustic LookAhead

Vinit S. Unni; Ashish Mittal; Preethi Jyothi; Sunita Sarawagi

arXiv:2307.05006·cs.CL·July 12, 2023

Improving RNN-Transducers with Acoustic LookAhead

Vinit S. Unni, Ashish Mittal, Preethi Jyothi, Sunita Sarawagi

PDF

Open Access

TL;DR

This paper introduces LookAhead, a method to enhance RNN-Transducers by making text representations more acoustically grounded, resulting in significant improvements in speech-to-text accuracy across various datasets.

Contribution

We propose LookAhead, a novel technique that incorporates future audio context into text representations in RNN-T models, reducing hallucinations and improving accuracy.

Findings

01

5%-20% relative WER reduction on evaluation sets

02

Improved robustness to out-of-domain data

03

Enhanced acoustic grounding of text representations

Abstract

RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion because of their high accuracy and streaming capabilities. A typical RNN-T independently encodes the input audio and the text context, and combines the two encodings by a thin joint network. While this architecture provides SOTA streaming accuracy, it also makes the model vulnerable to strong LM biasing which manifests as multi-step hallucination of text without acoustic evidence. In this paper we propose LookAhead that makes text representations more acoustically grounded by looking ahead into the future within the audio input. This technique yields a significant 5%-20% relative reduction in word error rate on both in-domain and out-of-domain evaluation sets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing