Effective Theory of Transformers at Initialization
Emily Dinan, Sho Yaida, Susan Zhang

TL;DR
This paper analyzes how signals propagate in wide, deep Transformers at initialization, providing guidance on optimal hyperparameter scaling, and validates these insights through practical training of Vision and Language Transformers.
Contribution
It offers an effective-theory framework for understanding Transformer initialization and proposes specific hyperparameter scalings validated by experiments.
Findings
Guides on width scalings for initialization and training
Improved training stability and performance in Vision and Language Transformers
Theoretical insights align with empirical results
Abstract
We perform an effective-theory analysis of forward-backward signal propagation in wide and deep Transformers, i.e., residual neural networks with multi-head self-attention blocks and multilayer perceptron blocks. This analysis suggests particular width scalings of initialization and training hyperparameters for these models. We then take up such suggestions, training Vision and Language Transformers in practical setups.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Model Reduction and Neural Networks
