Improving RNN-T ASR Accuracy Using Context Audio
Andreas Schwarz, Ilya Sklyar, Simon Wiesler

TL;DR
This paper introduces a training scheme for RNN-T ASR that enables the encoder to utilize context audio, resulting in significant word error rate reductions and improved adaptation to challenging acoustic environments.
Contribution
The paper proposes a novel training approach allowing RNN-T models to leverage context audio, enhancing accuracy and robustness in streaming speech recognition systems.
Findings
Over 6% WER reduction in production settings
Improved adaptation to background speech
Visualization of gradient flow shows long-term context learning
Abstract
We present a training scheme for streaming automatic speech recognition (ASR) based on recurrent neural network transducers (RNN-T) which allows the encoder network to learn to exploit context audio from a stream, using segmented or partially labeled sequences of the stream during training. We show that the use of context audio during training and inference can lead to word error rate reductions of more than 6% in a realistic production setting for a voice assistant ASR system. We investigate the effect of the proposed training approach on acoustically challenging data containing background speech and present data points which indicate that this approach helps the network learn both speaker and environment adaptation. To gain further insight into the ability of a long short-term memory (LSTM) based ASR encoder to exploit long-term context, we also visualize RNN-T loss gradients with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
