Mutually-Constrained Monotonic Multihead Attention for Online ASR

Jaeyun Song; Hajin Shim; Eunho Yang

arXiv:2103.14302·cs.CL·March 29, 2021

Mutually-Constrained Monotonic Multihead Attention for Online ASR

Jaeyun Song, Hajin Shim, Eunho Yang

PDF

Open Access

TL;DR

This paper introduces a novel training method for Monotonic Multihead Attention in online ASR that considers interactions across heads to reduce latency and improve performance during inference.

Contribution

It proposes a training approach that incorporates head interactions in MMA, aligning training and testing phases to enhance real-time ASR performance.

Findings

01

Improved ASR accuracy over baseline models

02

Reduced test latency in online decoding

03

Effective in standard benchmark datasets

Abstract

Despite the feature of real-time decoding, Monotonic Multihead Attention (MMA) shows comparable performance to the state-of-the-art offline methods in machine translation and automatic speech recognition (ASR) tasks. However, the latency of MMA is still a major issue in ASR and should be combined with a technique that can reduce the test latency at inference time, such as head-synchronous beam search decoding, which forces all non-activated heads to activate after a small fixed delay from the first head activation. In this paper, we remove the discrepancy between training and test phases by considering, in the training of MMA, the interactions across multiple heads that will occur in the test time. Specifically, we derive the expected alignments from monotonic attention by considering the boundaries of other heads and reflect them in the learning process. We validate our proposed method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing