Improving RNN-T ASR Accuracy Using Context Audio

Andreas Schwarz; Ilya Sklyar; Simon Wiesler

arXiv:2011.10538·eess.AS·June 16, 2021

Improving RNN-T ASR Accuracy Using Context Audio

Andreas Schwarz, Ilya Sklyar, Simon Wiesler

PDF

TL;DR

This paper introduces a training scheme for RNN-T ASR that enables the encoder to utilize context audio, resulting in significant word error rate reductions and improved adaptation to challenging acoustic environments.

Contribution

The paper proposes a novel training approach allowing RNN-T models to leverage context audio, enhancing accuracy and robustness in streaming speech recognition systems.

Findings

01

Over 6% WER reduction in production settings

02

Improved adaptation to background speech

03

Visualization of gradient flow shows long-term context learning

Abstract

We present a training scheme for streaming automatic speech recognition (ASR) based on recurrent neural network transducers (RNN-T) which allows the encoder network to learn to exploit context audio from a stream, using segmented or partially labeled sequences of the stream during training. We show that the use of context audio during training and inference can lead to word error rate reductions of more than 6% in a realistic production setting for a voice assistant ASR system. We investigate the effect of the proposed training approach on acoustically challenging data containing background speech and present data points which indicate that this approach helps the network learn both speaker and environment adaptation. To gain further insight into the ability of a long short-term memory (LSTM) based ASR encoder to exploit long-term context, we also visualize RNN-T loss gradients with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.