B2T Connection: Serving Stability and Performance in Deep Transformers

Sho Takase; Shun Kiyono; Sosuke Kobayashi; Jun Suzuki

arXiv:2206.00330·cs.LG·May 29, 2023

B2T Connection: Serving Stability and Performance in Deep Transformers

Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki

PDF

Open Access 1 Repo

TL;DR

This paper investigates the causes of training instability in deep Transformers with Post-LN and proposes a simple modification to improve stability and performance across various text generation tasks.

Contribution

The study provides empirical and theoretical insights into LN position effects and introduces a modification to Post-LN that enhances stability and effectiveness in deep Transformer training.

Findings

01

Post-LN causes vanishing gradients leading to instability in deep Transformers.

02

Pre-LN prevents vanishing gradients, enabling stable training.

03

Modified Post-LN achieves both stability and high performance in experiments.

Abstract

From the perspective of the layer normalization (LN) positions, the architectures of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers (e.g., those with ten or more layers), the training is often unstable, resulting in useless models. However, Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers (e.g., those with six or fewer layers). This study first investigates the reason for these discrepant observations empirically and theoretically and made the following discoveries: 1, the LN in Post-LN is the main source of the vanishing gradient problem that leads to unstable training, whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation, which may lead to effective training.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

takase/b2t_connection
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsLayer Normalization