Dissecting User-Perceived Latency of On-Device E2E Speech Recognition
Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang, Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen,, Michael L. Seltzer

TL;DR
This paper investigates how different factors like model architecture and endpointing affect user-perceived latency in on-device end-to-end speech recognition systems, highlighting that traditional measures may not accurately reflect real-world latency.
Contribution
It provides an analysis of various techniques impacting latency, emphasizing the importance of token emission and endpointing over conventional computational metrics.
Findings
Model size and FLOPS are not strongly correlated with user-perceived latency.
Token emission timing and endpointing significantly influence latency.
Joint ASR and endpointing with alignment regularization offers optimal latency-WER trade-offs.
Abstract
As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task. Apart from being accurate and compact, such systems need to decode speech with low user-perceived latency (UPL), producing words as soon as they are spoken. This work examines the impact of various techniques - model architectures, training criteria, decoding hyperparameters, and endpointer parameters - on UPL. Our analyses suggest that measures of model size (parameters, input chunk sizes), or measures of computation (e.g., FLOPS, RTF) that reflect the model's ability to process input frames are not always strongly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
