FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization
Jiahui Yu, Chung-Cheng Chiu, Bo Li, Shuo-yiin Chang, Tara N. Sainath,, Yanzhang He, Arun Narayanan, Wei Han, Anmol Gulati, Yonghui Wu, Ruoming Pang

TL;DR
FastEmit introduces a sequence-level regularization technique for streaming ASR that reduces latency significantly while improving accuracy, without needing alignment data, applicable across various transducer models.
Contribution
The paper presents FastEmit, a novel sequence-level emission regularization method that enhances streaming ASR latency and accuracy without requiring alignment information.
Findings
Achieves 150-300 ms latency reduction on Voice Search.
Improves WER from 4.4%/8.9% to 3.1%/7.5%.
Reduces 90th percentile latency from 210 ms to 30 ms on LibriSpeech.
Abstract
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. However, emitting fast without degrading quality, as measured by word error rate (WER), is highly challenging. Existing approaches including Early and Late Penalties and Constrained Alignments penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models. While being successful in reducing delay, these approaches suffer from significant accuracy regression and also require additional word alignment information from an existing model. In this work, we propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models, and does not require any alignment. We demonstrate that FastEmit is more suitable to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
