DeepNet: Scaling Transformers to 1,000 Layers

Hongyu Wang; Shuming Ma; Li Dong; Shaohan Huang; Dongdong Zhang; Furu; Wei

arXiv:2203.00555·cs.CL·March 2, 2022·54 cites

DeepNet: Scaling Transformers to 1,000 Layers

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Furu, Wei

PDF

Open Access 5 Repos 3 Models

TL;DR

This paper introduces DeepNorm, a normalization technique that stabilizes training of extremely deep Transformers, enabling models up to 1,000 layers and achieving superior performance on large-scale multilingual translation tasks.

Contribution

The paper presents DeepNorm, a new normalization method with theoretical analysis, allowing stable training of Transformers up to 1,000 layers, surpassing previous depth limits.

Findings

01

Successfully trained 1,000-layer Transformers.

02

200-layer model outperforms larger 48-layer models on translation.

03

DeepNorm combines advantages of Post-LN and Pre-LN normalization.

Abstract

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Ferroelectric and Negative Capacitance Devices · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Layer Normalization · Label Smoothing