Monotonic Multihead Attention
Xutai Ma, Juan Pino, James Cross, Liezl Puzon, Jiatao Gu

TL;DR
This paper introduces Monotonic Multihead Attention (MMA), a novel attention mechanism for simultaneous machine translation that improves latency-quality tradeoffs by extending monotonic attention to multiple heads with interpretable latency controls.
Contribution
The paper proposes MMA, a new multihead attention mechanism with latency control methods, advancing the state-of-the-art in simultaneous translation models.
Findings
MMA outperforms previous methods like MILk in latency-quality tradeoffs.
Latency controls influence attention span and translation quality.
Analysis of decoder layers and heads shows their impact on performance.
Abstract
Simultaneous machine translation models start generating a target sequence before they have encoded or read the source sequence. Recent approaches for this task either apply a fixed policy on a state-of-the art Transformer model, or a learnable monotonic attention on a weaker recurrent neural network-based structure. In this paper, we propose a new attention mechanism, Monotonic Multihead Attention (MMA), which extends the monotonic attention mechanism to multihead attention. We also introduce two novel and interpretable approaches for latency control that are specifically designed for multiple attentions heads. We apply MMA to the simultaneous machine translation task and demonstrate better latency-quality tradeoffs compared to MILk, the previous state-of-the-art approach. We also analyze how the latency controls affect the attention span and we motivate the introduction of our model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
