DeepNet: Scaling Transformers to 1,000 Layers
Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Furu, Wei

TL;DR
This paper introduces DeepNorm, a normalization technique that stabilizes training of extremely deep Transformers, enabling models up to 1,000 layers and achieving superior performance on large-scale multilingual translation tasks.
Contribution
The paper presents DeepNorm, a new normalization method with theoretical analysis, allowing stable training of Transformers up to 1,000 layers, surpassing previous depth limits.
Findings
Successfully trained 1,000-layer Transformers.
200-layer model outperforms larger 48-layer models on translation.
DeepNorm combines advantages of Post-LN and Pre-LN normalization.
Abstract
In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Ferroelectric and Negative Capacitance Devices · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Layer Normalization · Label Smoothing
