B2T Connection: Serving Stability and Performance in Deep Transformers
Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki

TL;DR
This paper investigates the causes of training instability in deep Transformers with Post-LN and proposes a simple modification to improve stability and performance across various text generation tasks.
Contribution
The study provides empirical and theoretical insights into LN position effects and introduces a modification to Post-LN that enhances stability and effectiveness in deep Transformer training.
Findings
Post-LN causes vanishing gradients leading to instability in deep Transformers.
Pre-LN prevents vanishing gradients, enabling stable training.
Modified Post-LN achieves both stability and high performance in experiments.
Abstract
From the perspective of the layer normalization (LN) positions, the architectures of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers (e.g., those with ten or more layers), the training is often unstable, resulting in useless models. However, Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers (e.g., those with six or fewer layers). This study first investigates the reason for these discrepant observations empirically and theoretically and made the following discoveries: 1, the LN in Post-LN is the main source of the vanishing gradient problem that leads to unstable training, whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation, which may lead to effective training.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLayer Normalization
