Lipschitz Constrained Parameter Initialization for Deep Transformers
Hongfei Xu, Qiuhui Liu, Josef van Genabith, Deyi Xiong and, Jingyi Zhang

TL;DR
This paper introduces a Lipschitz-based parameter initialization method that improves the training and convergence of deep Transformer models, enabling effective training of models with more than 12 layers and enhancing translation quality.
Contribution
It proposes a novel Lipschitz constrained initialization technique that ensures convergence of deep Transformers with original computation order, extending benefits to both encoders and decoders.
Findings
Deep Transformers with over 12 layers can be trained successfully.
Lipschitz initialization improves BLEU scores for deep models.
Original computation order can be used effectively with proper initialization.
Abstract
The Transformer translation model employs residual connection and layer normalization to ease the optimization difficulties caused by its multi-layer encoder/decoder structure. Previous research shows that even with residual connection and layer normalization, deep Transformers still have difficulty in training, and particularly Transformer models with more than 12 encoder/decoder layers fail to converge. In this paper, we first empirically demonstrate that a simple modification made in the official implementation, which changes the computation order of residual connection and layer normalization, can significantly ease the optimization of deep Transformers. We then compare the subtle differences in computation order in considerable detail, and present a parameter initialization method that leverages the Lipschitz constraint on the initialization of Transformer parameters that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Ferroelectric and Negative Capacitance Devices · Topic Modeling
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
