Personalized Speech recognition on mobile devices

Ian McGraw; Rohit Prabhavalkar; Raziel Alvarez; Montse Gonzalez; Arenas; Kanishka Rao; David Rybach; Ouais Alsharif; Hasim Sak; Alexander; Gruenstein; Francoise Beaufays; Carolina Parada

arXiv:1603.03185·cs.CL·March 15, 2016·22 cites

Personalized Speech recognition on mobile devices

Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez, Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Hasim Sak, Alexander, Gruenstein, Francoise Beaufays, Carolina Parada

PDF

Open Access

TL;DR

This paper presents a highly efficient, accurate, and low-latency speech recognition system optimized for mobile devices, combining quantized LSTM models, SVD compression, and adaptive language modeling.

Contribution

It introduces a novel mobile speech recognition system using quantized LSTM with CTC, SVD compression, and a unified Bayesian-interpolated language model for multiple domains.

Findings

01

Achieves 13.5% WER on dictation task

02

Runs seven times faster than real-time on Nexus 5

03

Uses a single adaptive language model for multiple domains

Abstract

We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings