Effective Theory of Transformers at Initialization

Emily Dinan; Sho Yaida; Susan Zhang

arXiv:2304.02034·cs.LG·April 6, 2023·1 cites

Effective Theory of Transformers at Initialization

Emily Dinan, Sho Yaida, Susan Zhang

PDF

Open Access

TL;DR

This paper analyzes how signals propagate in wide, deep Transformers at initialization, providing guidance on optimal hyperparameter scaling, and validates these insights through practical training of Vision and Language Transformers.

Contribution

It offers an effective-theory framework for understanding Transformer initialization and proposes specific hyperparameter scalings validated by experiments.

Findings

01

Guides on width scalings for initialization and training

02

Improved training stability and performance in Vision and Language Transformers

03

Theoretical insights align with empirical results

Abstract

We perform an effective-theory analysis of forward-backward signal propagation in wide and deep Transformers, i.e., residual neural networks with multi-head self-attention blocks and multilayer perceptron blocks. This analysis suggests particular width scalings of initialization and training hyperparameters for these models. We then take up such suggestions, training Vision and Language Transformers in practical setups.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Model Reduction and Neural Networks