Streaming End-to-end Speech Recognition For Mobile Devices

Yanzhang He; Tara N. Sainath; Rohit Prabhavalkar; Ian McGraw; Raziel; Alvarez; Ding Zhao; David Rybach; Anjuli Kannan; Yonghui Wu; Ruoming Pang,; Qiao Liang; Deepti Bhatia; Yuan Shangguan; Bo Li; Golan Pundak; Khe Chai Sim,; Tom Bagby; Shuo-yiin Chang; Kanishka Rao; Alexander Gruenstein

arXiv:1811.06621·cs.CL·November 19, 2018·23 cites

Streaming End-to-end Speech Recognition For Mobile Devices

Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel, Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang,, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim,, Tom Bagby, Shuo-yiin Chang, Kanishka Rao

PDF

Open Access 2 Repos

TL;DR

This paper presents a streaming end-to-end speech recognition model using a recurrent neural network transducer, optimized for real-time on-device use, demonstrating improved latency and accuracy over traditional methods.

Contribution

The work introduces a streaming E2E speech recognizer with RNN transducer architecture, addressing real-time decoding, robustness, and personalization challenges.

Findings

01

Outperforms CTC-based models in latency and accuracy

02

Effective for real-time on-device speech recognition

03

Handles diverse use cases with improved robustness

Abstract

End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing