StableEmit: Selection Probability Discount for Reducing Emission Latency   of Streaming Monotonic Attention ASR

Hirofumi Inaguma; Tatsuya Kawahara

arXiv:2107.00635·eess.AS·July 16, 2021

StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR

Hirofumi Inaguma, Tatsuya Kawahara

PDF

TL;DR

StableEmit is a novel regularization technique for streaming ASR models that reduces emission latency without relying on alignment information, improving accuracy and speed.

Contribution

It introduces a simple, alignment-free regularization method that encourages earlier token emission in monotonic attention models, enhancing streaming ASR performance.

Findings

01

Significantly reduces recognition errors and emission latency.

02

Effective with both LSTM and Conformer encoders.

03

Complementary to alignment-based methods for further improvements.

Abstract

While attention-based encoder-decoder (AED) models have been successfully extended to the online variants for streaming automatic speech recognition (ASR), such as monotonic chunkwise attention (MoChA), the models still have a large label emission latency because of the unconstrained end-to-end training objective. Previous works tackled this problem by leveraging alignment information to control the timing to emit tokens during training. In this work, we propose a simple alignment-free regularization method, StableEmit, to encourage MoChA to emit tokens earlier. StableEmit discounts the selection probabilities in hard monotonic attention for token boundary detection by a constant factor and regularizes them to recover the total attention mass during training. As a result, the scale of the selection probabilities is increased, and the values can reach a threshold for token emission…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory