Streaming End-to-end Speech Recognition For Mobile Devices
Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel, Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang,, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim,, Tom Bagby, Shuo-yiin Chang, Kanishka Rao

TL;DR
This paper presents a streaming end-to-end speech recognition model using a recurrent neural network transducer, optimized for real-time on-device use, demonstrating improved latency and accuracy over traditional methods.
Contribution
The work introduces a streaming E2E speech recognizer with RNN transducer architecture, addressing real-time decoding, robustness, and personalization challenges.
Findings
Outperforms CTC-based models in latency and accuracy
Effective for real-time on-device speech recognition
Handles diverse use cases with improved robustness
Abstract
End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
