Learning Deep Transformer Models for Machine Translation
Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F., Wong, Lidia S. Chao

TL;DR
This paper demonstrates that a properly designed deep Transformer model with advanced normalization and layer passing techniques can outperform standard Transformer-Big models in machine translation, while being smaller and faster to train.
Contribution
The paper introduces a novel deep Transformer architecture that surpasses Transformer-Big in translation quality, size, and training speed by using improved normalization and layer passing methods.
Findings
Deep Transformer models outperform Transformer-Big in BLEU scores.
Deep models are 1.6 times smaller and 3 times faster to train.
Proper normalization and layer passing are key to training very deep Transformers.
Abstract
Transformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models of this kind: the first uses wide networks (a.k.a. Transformer-Big) and has been the de facto standard for the development of the Transformer system, and the other uses deeper language representation but faces the difficulty arising from learning deep networks. Here, we continue the line of research on the latter. We claim that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next. On WMT'16 English- German, NIST OpenMT'12 Chinese-English and larger WMT'18 Chinese-English tasks, our deep system (30/25-layer encoder) outperforms the shallow Transformer-Big/Base baseline (6-layer encoder) by 0.4-2.4 BLEU points. As…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
