StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR
Hirofumi Inaguma, Tatsuya Kawahara

TL;DR
StableEmit is a novel regularization technique for streaming ASR models that reduces emission latency without relying on alignment information, improving accuracy and speed.
Contribution
It introduces a simple, alignment-free regularization method that encourages earlier token emission in monotonic attention models, enhancing streaming ASR performance.
Findings
Significantly reduces recognition errors and emission latency.
Effective with both LSTM and Conformer encoders.
Complementary to alignment-based methods for further improvements.
Abstract
While attention-based encoder-decoder (AED) models have been successfully extended to the online variants for streaming automatic speech recognition (ASR), such as monotonic chunkwise attention (MoChA), the models still have a large label emission latency because of the unconstrained end-to-end training objective. Previous works tackled this problem by leveraging alignment information to control the timing to emit tokens during training. In this work, we propose a simple alignment-free regularization method, StableEmit, to encourage MoChA to emit tokens earlier. StableEmit discounts the selection probabilities in hard monotonic attention for token boundary detection by a constant factor and regularizes them to recover the total attention mass during training. As a result, the scale of the selection probabilities is increased, and the values can reach a threshold for token emission…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
