On The Alignment Problem In Multi-Head Attention-Based Neural Machine   Translation

Tamer Alkhouli; Gabriel Bretschner; and Hermann Ney

arXiv:1809.03985·cs.CL·September 12, 2018

On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation

Tamer Alkhouli, Gabriel Bretschner, and Hermann Ney

PDF

TL;DR

This paper enhances multi-head attention in transformer-based neural machine translation by adding an alignment head for sharper attention, improving translation quality and decoding speed, especially in dictionary-guided tasks.

Contribution

It introduces an additional alignment head to improve attention clarity and proposes alignment pruning to accelerate decoding without performance loss.

Findings

01

Up to 3.8% BLEU improvement with dictionary guidance.

02

Alignment pruning speeds up decoding by 1.8x.

03

Effective on WMT 2016 and BOLT datasets.

Abstract

This work investigates the alignment problem in state-of-the-art multi-head attention models based on the transformer architecture. We demonstrate that alignment extraction in transformer models can be improved by augmenting an additional alignment head to the multi-head source-to-target attention component. This is used to compute sharper attention weights. We describe how to use the alignment head to achieve competitive performance. To study the effect of adding the alignment head, we simulate a dictionary-guided translation task, where the user wants to guide translation using pre-defined dictionary entries. Using the proposed approach, we achieve up to $3.8$ % BLEU improvement when using the dictionary, in comparison to $2.4$ % BLEU in the baseline case. We also propose alignment pruning to speed up decoding in alignment-based neural machine translation (ANMT), which speeds up…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPruning · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia?