Contextual RNN-T For Open Domain ASR

Mahaveer Jain; Gil Keren; Jay Mahadeokar; Geoffrey Zweig; Florian; Metze; Yatharth Saraf

arXiv:2006.03411·eess.AS·August 14, 2020

Contextual RNN-T For Open Domain ASR

Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian, Metze, Yatharth Saraf

PDF

TL;DR

This paper introduces modifications to the RNN-T model that incorporate contextual metadata to improve recognition of rare words, especially named entities, in open domain speech recognition tasks.

Contribution

The paper proposes a novel approach to enhance RNN-T models by integrating contextual metadata, addressing the challenge of recognizing rare words in open domain ASR.

Findings

01

16% relative improvement in WER-NE on videos with metadata

02

Effective use of attention and biasing models for context incorporation

03

Improved recognition of named entities in open domain ASR

Abstract

End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system - acoustic model, language model, pronunciation model - into a single neural network. While this has some nice advantages, it limits the system to be trained using only paired audio and text. Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names. In this paper, we propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words. We evaluate our approach on an in-house dataset sampled from de-identified public social media videos, which represent an open domain ASR task. By using an attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.