Weighted Transformer Network for Machine Translation

Karim Ahmed; Nitish Shirish Keskar; Richard Socher

arXiv:1711.02132·cs.AI·November 8, 2017·134 cites

Weighted Transformer Network for Machine Translation

Karim Ahmed, Nitish Shirish Keskar, Richard Socher

PDF

Open Access 5 Repos

TL;DR

The paper introduces Weighted Transformer, an improved self-attention model for machine translation that converges faster and achieves higher BLEU scores than the baseline Transformer architecture.

Contribution

It proposes a novel weighted attention mechanism with multiple self-attention branches that enhances performance and training efficiency in neural machine translation.

Findings

01

Outperforms baseline Transformer in BLEU scores by 0.5 and 0.4 points on WMT tasks.

02

Converges 15-40% faster during training.

03

Achieves state-of-the-art results on translation benchmarks.

Abstract

State-of-the-art results on neural machine translation often use attentional sequence-to-sequence models with some form of convolution or recursion. Vaswani et al. (2017) propose a new architecture that avoids recurrence and convolution completely. Instead, it uses only self-attention and feed-forward layers. While the proposed architecture achieves state-of-the-art results on several machine translation tasks, it requires a large number of parameters and training iterations to converge. We propose Weighted Transformer, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15-40% faster. Specifically, we replace the multi-head attention by multiple self-attention branches that the model learns to combine during the training process. Our model improves the state-of-the-art performance by 0.5 BLEU points on the WMT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax