Recognizing long-form speech using streaming end-to-end models
Arun Narayanan, Rohit Prabhavalkar, Chung-Cheng Chiu, David Rybach,, Tara N. Sainath, Trevor Strohman

TL;DR
This paper investigates the limitations of end-to-end speech recognition models in handling long-form speech and proposes data diversity and LSTM state manipulation techniques to improve their generalization to longer audio segments.
Contribution
The paper introduces methods to enhance E2E ASR models' ability to recognize long-form speech by combining data diversity with LSTM state simulation techniques.
Findings
Data diversity improves WER by 90% on synthesized long-form test set.
Simulating long-form audio reduces WER by 67% on synthesized data.
On real long-form data, data diversity improves WER by 40%, and combined with simulation, by 67%.
Abstract
All-neural end-to-end (E2E) automatic speech recognition (ASR) systems that use a single neural network to transduce audio to word sequences have been shown to achieve state-of-the-art results on several tasks. In this work, we examine the ability of E2E models to generalize to unseen domains, where we find that models trained on short utterances fail to generalize to long-form speech. We propose two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances. On a synthesized long-form test set, adding data diversity improves word error rate (WER) by 90% relative, while simulating long-form training improves it by 67% relative, though the combination doesn't improve over data diversity alone. On a real long-form call-center test set, adding data diversity improves WER by 40%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTest · Sigmoid Activation · Tanh Activation · Long Short-Term Memory
