Improving Deep Transformer with Depth-Scaled Initialization and Merged   Attention

Biao Zhang; Ivan Titov; Rico Sennrich

arXiv:1908.11365·cs.CL·August 30, 2019·5 cites

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

Biao Zhang, Ivan Titov, Rico Sennrich

PDF

Open Access 1 Repo

TL;DR

This paper introduces depth-scaled initialization and merged attention to improve training and efficiency of deep Transformer models, significantly enhancing translation performance without increasing decoding time.

Contribution

It proposes novel initialization and attention merging techniques that enable deeper Transformers to train effectively and efficiently for machine translation.

Findings

01

Deep Transformers with proposed methods outperform baseline models in BLEU scores.

02

Depth-scaled initialization alleviates gradient vanishing issues.

03

Merged attention reduces computational cost while maintaining performance.

Abstract

The general trend in NLP is towards increasing model capacity and performance via deeper neural networks. However, simply stacking more layers of the popular Transformer architecture for machine translation results in poor convergence and high computational overhead. Our empirical analysis suggests that convergence is poor due to gradient vanishing caused by the interaction between residual connections and layer normalization. We propose depth-scaled initialization (DS-Init), which decreases parameter variance at the initialization stage, and reduces output variance of residual connections so as to ease gradient back-propagation through normalization layers. To address computational cost, we propose a merged attention sublayer (MAtt) which combines a simplified averagebased self-attention sublayer and the encoderdecoder attention sublayer on the decoder side. Results on WMT and IWSLT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bzhangGo/zero
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam