Very Deep Transformers for Neural Machine Translation
Xiaodong Liu, Kevin Duh, Liyuan Liu, Jianfeng Gao

TL;DR
This paper demonstrates that very deep Transformer models, with up to 60 encoder layers, can be effectively trained for neural machine translation, significantly improving translation quality over standard models.
Contribution
It introduces a simple initialization technique enabling stable training of extremely deep Transformer models for NMT, achieving state-of-the-art results.
Findings
Deep Transformer models outperform baseline 6-layer models by up to 2.5 BLEU.
Achieved new state-of-the-art on WMT14 English-French and English-German benchmarks.
Training stability is improved with the proposed initialization method.
Abstract
We explore the application of very deep Transformer models for Neural Machine Translation (NMT). Using a simple yet effective initialization technique that stabilizes training, we show that it is feasible to build standard Transformer-based models with up to 60 encoder layers and 12 decoder layers. These deep models outperform their baseline 6-layer counterparts by as much as 2.5 BLEU, and achieve new state-of-the-art benchmark results on WMT14 English-French (43.8 BLEU and 46.4 BLEU with back-translation) and WMT14 English-German (30.1 BLEU).The code and trained models will be publicly available at: https://github.com/namisan/exdeep-nmt.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Multi-Head Attention · Attention Is All You Need · Byte Pair Encoding · Dropout · Label Smoothing · Residual Connection
