Two-Pass End-to-End Speech Recognition

Tara N. Sainath; Ruoming Pang; David Rybach; Yanzhang He; Rohit; Prabhavalkar; Wei Li; Mirk\'o Visontai; Qiao Liang; Trevor Strohman; Yonghui; Wu; Ian McGraw; Chung-Cheng Chiu

arXiv:1908.10992·cs.CL·August 30, 2019

Two-Pass End-to-End Speech Recognition

Tara N. Sainath, Ruoming Pang, David Rybach, Yanzhang He, Rohit, Prabhavalkar, Wei Li, Mirk\'o Visontai, Qiao Liang, Trevor Strohman, Yonghui, Wu, Ian McGraw, Chung-Cheng Chiu

PDF

1 Repo

TL;DR

This paper introduces a two-pass end-to-end speech recognition system that combines streaming RNN-T and LAS models to improve accuracy while maintaining low latency suitable for real-time applications.

Contribution

It proposes a novel two-pass architecture that enhances streaming speech recognition quality by integrating LAS as a second-pass component, reducing WER significantly.

Findings

01

Achieves 17-22% relative WER reduction over RNN-T alone

02

Maintains low latency suitable for streaming applications

03

Demonstrates improved recognition accuracy in real-time settings

Abstract

The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models [1]. However, this model still lags behind a large state-of-the-art conventional model in quality [2]. On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models [3]. This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fd873630/RNN-Transducer
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.