Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

Yuan Shangguan; Rohit Prabhavalkar; Hang Su; Jay Mahadeokar; Yangyang; Shi; Jiatong Zhou; Chunyang Wu; Duc Le; Ozlem Kalinli; Christian Fuegen,; Michael L. Seltzer

arXiv:2104.02207·cs.SD·August 13, 2021

Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang, Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen,, Michael L. Seltzer

PDF

Open Access

TL;DR

This paper investigates how different factors like model architecture and endpointing affect user-perceived latency in on-device end-to-end speech recognition systems, highlighting that traditional measures may not accurately reflect real-world latency.

Contribution

It provides an analysis of various techniques impacting latency, emphasizing the importance of token emission and endpointing over conventional computational metrics.

Findings

01

Model size and FLOPS are not strongly correlated with user-perceived latency.

02

Token emission timing and endpointing significantly influence latency.

03

Joint ASR and endpointing with alignment regularization offers optimal latency-WER trade-offs.

Abstract

As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task. Apart from being accurate and compact, such systems need to decode speech with low user-perceived latency (UPL), producing words as soon as they are spoken. This work examines the impact of various techniques - model architectures, training criteria, decoding hyperparameters, and endpointer parameters - on UPL. Our analyses suggest that measures of model size (parameters, input chunk sizes), or measures of computation (e.g., FLOPS, RTF) that reflect the model's ability to process input frames are not always strongly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing