Analyzing the Quality and Stability of a Streaming End-to-End On-Device   Speech Recognizer

Yuan Shangguan; Kate Knister; Yanzhang He; Ian McGraw; Francoise; Beaufays

arXiv:2006.01416·cs.CL·August 18, 2020

Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer

Yuan Shangguan, Kate Knister, Yanzhang He, Ian McGraw, Francoise, Beaufays

PDF

Open Access

TL;DR

This paper evaluates the quality and stability of on-device streaming end-to-end speech recognition models, introducing new metrics and solutions to improve stability without compromising recognition accuracy.

Contribution

It introduces novel metrics for quantifying instability and explores techniques to mitigate instability in streaming E2E ASR models.

Findings

01

New metrics effectively measure instability at word and segment levels.

02

Certain training techniques improve accuracy but can increase instability.

03

Proposed solutions help reduce instability in streaming ASR systems.

Abstract

The demand for fast and accurate incremental speech recognition increases as the applications of automatic speech recognition (ASR) proliferate. Incremental speech recognizers output chunks of partially recognized words while the user is still talking. Partial results can be revised before the ASR finalizes its hypothesis, causing instability issues. We analyze the quality and stability of on-device streaming end-to-end (E2E) ASR models. We first introduce a novel set of metrics that quantify the instability at word and segment levels. We study the impact of several model training techniques that improve E2E model qualities but degrade model stability. We categorize the causes of instability and explore various solutions to mitigate them in a streaming E2E ASR system. Index Terms: ASR, stability, end-to-end, text normalization,on-device, RNN-T

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing