FastEmit: Low-latency Streaming ASR with Sequence-level Emission   Regularization

Jiahui Yu; Chung-Cheng Chiu; Bo Li; Shuo-yiin Chang; Tara N. Sainath,; Yanzhang He; Arun Narayanan; Wei Han; Anmol Gulati; Yonghui Wu; Ruoming Pang

arXiv:2010.11148·eess.AS·February 5, 2021·6 cites

FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization

Jiahui Yu, Chung-Cheng Chiu, Bo Li, Shuo-yiin Chang, Tara N. Sainath,, Yanzhang He, Arun Narayanan, Wei Han, Anmol Gulati, Yonghui Wu, Ruoming Pang

PDF

Open Access 1 Repo

TL;DR

FastEmit introduces a sequence-level regularization technique for streaming ASR that reduces latency significantly while improving accuracy, without needing alignment data, applicable across various transducer models.

Contribution

The paper presents FastEmit, a novel sequence-level emission regularization method that enhances streaming ASR latency and accuracy without requiring alignment information.

Findings

01

Achieves 150-300 ms latency reduction on Voice Search.

02

Improves WER from 4.4%/8.9% to 3.1%/7.5%.

03

Reduces 90th percentile latency from 210 ms to 30 ms on LibriSpeech.

Abstract

Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. However, emitting fast without degrading quality, as measured by word error rate (WER), is highly challenging. Existing approaches including Early and Late Penalties and Constrained Alignments penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models. While being successful in reducing delay, these approaches suffer from significant accuracy regression and also require additional word alignment information from an existing model. In this work, we propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models, and does not require any alignment. We demonstrate that FastEmit is more suitable to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

victor45664/espnet
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing