A Language Agnostic Multilingual Streaming On-Device ASR System

Bo Li; Tara N. Sainath; Ruoming Pang; Shuo-yiin Chang; Qiumin Xu,; Trevor Strohman; Vince Chen; Qiao Liang; Heguang Liu; Yanzhang He; Parisa; Haghani; Sameer Bidichandani

arXiv:2208.13916·eess.AS·August 31, 2022

A Language Agnostic Multilingual Streaming On-Device ASR System

Bo Li, Tara N. Sainath, Ruoming Pang, Shuo-yiin Chang, Qiumin Xu,, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa, Haghani, Sameer Bidichandani

PDF

Open Access

TL;DR

This paper presents a fully on-device, streaming multilingual end-to-end speech recognition system that supports code switching and maintains high quality and low latency, using novel model components and optimization techniques.

Contribution

It introduces a language-agnostic streaming E2E ASR system with new model components and optimizations enabling real-time on-device multilingual speech recognition.

Findings

01

Achieved less than real-time processing on mobile devices.

02

Supported intersentential code switching in real time.

03

Replaced LSTM decoder with Embedding decoder for efficiency.

Abstract

On-device end-to-end (E2E) models have shown improvements over a conventional model on English Voice Search tasks in both quality and latency. E2E models have also shown promising results for multilingual automatic speech recognition (ASR). In this paper, we extend our previous capacity solution to streaming applications and present a streaming multilingual E2E ASR system that runs fully on device with comparable quality and latency to individual monolingual models. To achieve that, we propose an Encoder Endpointer model and an End-of-Utterance (EOU) Joint Layer for a better quality and latency trade-off. Our system is built in a language agnostic manner allowing it to natively support intersentential code switching in real time. To address the feasibility concerns on large models, we conducted on-device profiling and replaced the time consuming LSTM decoder with the recently developed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory