Very Deep Transformers for Neural Machine Translation

Xiaodong Liu; Kevin Duh; Liyuan Liu; Jianfeng Gao

arXiv:2008.07772·cs.CL·October 16, 2020·71 cites

Very Deep Transformers for Neural Machine Translation

Xiaodong Liu, Kevin Duh, Liyuan Liu, Jianfeng Gao

PDF

Open Access 4 Repos

TL;DR

This paper demonstrates that very deep Transformer models, with up to 60 encoder layers, can be effectively trained for neural machine translation, significantly improving translation quality over standard models.

Contribution

It introduces a simple initialization technique enabling stable training of extremely deep Transformer models for NMT, achieving state-of-the-art results.

Findings

01

Deep Transformer models outperform baseline 6-layer models by up to 2.5 BLEU.

02

Achieved new state-of-the-art on WMT14 English-French and English-German benchmarks.

03

Training stability is improved with the proposed initialization method.

Abstract

We explore the application of very deep Transformer models for Neural Machine Translation (NMT). Using a simple yet effective initialization technique that stabilizes training, we show that it is feasible to build standard Transformer-based models with up to 60 encoder layers and 12 decoder layers. These deep models outperform their baseline 6-layer counterparts by as much as 2.5 BLEU, and achieve new state-of-the-art benchmark results on WMT14 English-French (43.8 BLEU and 46.4 BLEU with back-translation) and WMT14 English-German (30.1 BLEU).The code and trained models will be publicly available at: https://github.com/namisan/exdeep-nmt.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Multi-Head Attention · Attention Is All You Need · Byte Pair Encoding · Dropout · Label Smoothing · Residual Connection