Weighted Transformer Network for Machine Translation
Karim Ahmed, Nitish Shirish Keskar, Richard Socher

TL;DR
The paper introduces Weighted Transformer, an improved self-attention model for machine translation that converges faster and achieves higher BLEU scores than the baseline Transformer architecture.
Contribution
It proposes a novel weighted attention mechanism with multiple self-attention branches that enhances performance and training efficiency in neural machine translation.
Findings
Outperforms baseline Transformer in BLEU scores by 0.5 and 0.4 points on WMT tasks.
Converges 15-40% faster during training.
Achieves state-of-the-art results on translation benchmarks.
Abstract
State-of-the-art results on neural machine translation often use attentional sequence-to-sequence models with some form of convolution or recursion. Vaswani et al. (2017) propose a new architecture that avoids recurrence and convolution completely. Instead, it uses only self-attention and feed-forward layers. While the proposed architecture achieves state-of-the-art results on several machine translation tasks, it requires a large number of parameters and training iterations to converge. We propose Weighted Transformer, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15-40% faster. Specifically, we replace the multi-head attention by multiple self-attention branches that the model learns to combine during the training process. Our model improves the state-of-the-art performance by 0.5 BLEU points on the WMT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
