TL;DR
This paper introduces a two-pass end-to-end speech recognition system that combines streaming RNN-T and LAS models to improve accuracy while maintaining low latency suitable for real-time applications.
Contribution
It proposes a novel two-pass architecture that enhances streaming speech recognition quality by integrating LAS as a second-pass component, reducing WER significantly.
Findings
Achieves 17-22% relative WER reduction over RNN-T alone
Maintains low latency suitable for streaming applications
Demonstrates improved recognition accuracy in real-time settings
Abstract
The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models [1]. However, this model still lags behind a large state-of-the-art conventional model in quality [2]. On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models [3]. This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
