Personalized Speech recognition on mobile devices
Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez, Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Hasim Sak, Alexander, Gruenstein, Francoise Beaufays, Carolina Parada

TL;DR
This paper presents a highly efficient, accurate, and low-latency speech recognition system optimized for mobile devices, combining quantized LSTM models, SVD compression, and adaptive language modeling.
Contribution
It introduces a novel mobile speech recognition system using quantized LSTM with CTC, SVD compression, and a unified Bayesian-interpolated language model for multiple domains.
Findings
Achieves 13.5% WER on dictation task
Runs seven times faster than real-time on Nexus 5
Uses a single adaptive language model for multiple domains
Abstract
We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
