Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew, Brock, Samuel L Smith, Yee Whye Teh

TL;DR
This paper demonstrates that deep vanilla transformers can be trained effectively without skip connections or normalization by designing specific initialization and rescaling methods, addressing signal propagation challenges unique to transformers.
Contribution
The authors introduce novel initialization and rescaling techniques that enable training deep vanilla transformers without skip connections or normalization layers.
Findings
Deep vanilla transformers can be trained to match standard models' performance.
Proposed methods enable training of deep transformers without normalization at comparable speeds.
Deep transformers reach similar accuracy after approximately five times more iterations.
Abstract
Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them, using insights from wide NN kernel theory to improve signal propagation in vanilla DNNs (which we define as networks without skips or normalisation). However, these approaches are incompatible with the self-attention layers present in transformers, whose kernels are intrinsically more complicated to analyse and control. And so the question remains: is it possible to train deep vanilla transformers? We answer this question in the affirmative by designing several approaches that use combinations of parameter initialisations, bias matrices and location-dependent rescaling to achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications
