Contextual RNN-T For Open Domain ASR
Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian, Metze, Yatharth Saraf

TL;DR
This paper introduces modifications to the RNN-T model that incorporate contextual metadata to improve recognition of rare words, especially named entities, in open domain speech recognition tasks.
Contribution
The paper proposes a novel approach to enhance RNN-T models by integrating contextual metadata, addressing the challenge of recognizing rare words in open domain ASR.
Findings
16% relative improvement in WER-NE on videos with metadata
Effective use of attention and biasing models for context incorporation
Improved recognition of named entities in open domain ASR
Abstract
End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system - acoustic model, language model, pronunciation model - into a single neural network. While this has some nice advantages, it limits the system to be trained using only paired audio and text. Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names. In this paper, we propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words. We evaluate our approach on an in-house dataset sampled from de-identified public social media videos, which represent an open domain ASR task. By using an attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
