Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer
Yuan Shangguan, Kate Knister, Yanzhang He, Ian McGraw, Francoise, Beaufays

TL;DR
This paper evaluates the quality and stability of on-device streaming end-to-end speech recognition models, introducing new metrics and solutions to improve stability without compromising recognition accuracy.
Contribution
It introduces novel metrics for quantifying instability and explores techniques to mitigate instability in streaming E2E ASR models.
Findings
New metrics effectively measure instability at word and segment levels.
Certain training techniques improve accuracy but can increase instability.
Proposed solutions help reduce instability in streaming ASR systems.
Abstract
The demand for fast and accurate incremental speech recognition increases as the applications of automatic speech recognition (ASR) proliferate. Incremental speech recognizers output chunks of partially recognized words while the user is still talking. Partial results can be revised before the ASR finalizes its hypothesis, causing instability issues. We analyze the quality and stability of on-device streaming end-to-end (E2E) ASR models. We first introduce a novel set of metrics that quantify the instability at word and segment levels. We study the impact of several model training techniques that improve E2E model qualities but degrade model stability. We categorize the causes of instability and explore various solutions to mitigate them in a streaming E2E ASR system. Index Terms: ASR, stability, end-to-end, text normalization,on-device, RNN-T
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
