Joint Source-Target Self Attention with Locality Constraints
Jos\'e A. R. Fonollosa, Noe Casas, Marta R. Costa-juss\`a

TL;DR
This paper introduces a transformer-based decoder architecture with locality-constrained self-attention for neural machine translation, achieving state-of-the-art results on multiple benchmarks.
Contribution
It proposes a novel decoder architecture with locality constraints on self-attention, diverging from traditional encoder-decoder models.
Findings
Achieved 35.7 BLEU on IWSLT'14 German-English
Matched best results on WMT'14 English-German and French
Simplified architecture with competitive translation performance
Abstract
The dominant neural machine translation models are based on the encoder-decoder structure, and many of them rely on an unconstrained receptive field over source and target sequences. In this paper we study a new architecture that breaks with both conventions. Our simplified architecture consists in the decoder part of a transformer model, based on self-attention, but with locality constraints applied on the attention receptive field. As input for training, both source and target sentences are fed to the network, which is trained as a language model. At inference time, the target tokens are predicted autoregressively starting with the source sequence as previous tokens. The proposed model achieves a new state of the art of 35.7 BLEU on IWSLT'14 German-English and matches the best reported results in the literature on the WMT'14 English-German and WMT'14 English-French translation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
