Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention
Biao Zhang, Ivan Titov, Rico Sennrich

TL;DR
This paper introduces depth-scaled initialization and merged attention to improve training and efficiency of deep Transformer models, significantly enhancing translation performance without increasing decoding time.
Contribution
It proposes novel initialization and attention merging techniques that enable deeper Transformers to train effectively and efficiently for machine translation.
Findings
Deep Transformers with proposed methods outperform baseline models in BLEU scores.
Depth-scaled initialization alleviates gradient vanishing issues.
Merged attention reduces computational cost while maintaining performance.
Abstract
The general trend in NLP is towards increasing model capacity and performance via deeper neural networks. However, simply stacking more layers of the popular Transformer architecture for machine translation results in poor convergence and high computational overhead. Our empirical analysis suggests that convergence is poor due to gradient vanishing caused by the interaction between residual connections and layer normalization. We propose depth-scaled initialization (DS-Init), which decreases parameter variance at the initialization stage, and reduces output variance of residual connections so as to ease gradient back-propagation through normalization layers. To address computational cost, we propose a merged attention sublayer (MAtt) which combines a simplified averagebased self-attention sublayer and the encoderdecoder attention sublayer on the decoder side. Results on WMT and IWSLT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam
