Mutually-Constrained Monotonic Multihead Attention for Online ASR
Jaeyun Song, Hajin Shim, Eunho Yang

TL;DR
This paper introduces a novel training method for Monotonic Multihead Attention in online ASR that considers interactions across heads to reduce latency and improve performance during inference.
Contribution
It proposes a training approach that incorporates head interactions in MMA, aligning training and testing phases to enhance real-time ASR performance.
Findings
Improved ASR accuracy over baseline models
Reduced test latency in online decoding
Effective in standard benchmark datasets
Abstract
Despite the feature of real-time decoding, Monotonic Multihead Attention (MMA) shows comparable performance to the state-of-the-art offline methods in machine translation and automatic speech recognition (ASR) tasks. However, the latency of MMA is still a major issue in ASR and should be combined with a technique that can reduce the test latency at inference time, such as head-synchronous beam search decoding, which forces all non-activated heads to activate after a small fixed delay from the first head activation. In this paper, we remove the discrepancy between training and test phases by considering, in the training of MMA, the interactions across multiple heads that will occur in the test time. Specifically, we derive the expected alignments from monotonic attention by considering the boundaries of other heads and reflect them in the learning process. We validate our proposed method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
