Infinite Limits of Multi-head Transformer Dynamics
Blake Bordelon, Hamza Tahir Chaudhry, Cengiz Pehlevan

TL;DR
This paper investigates the training dynamics of transformer models in the feature learning regime, analyzing various infinite-width and depth limits using dynamical mean field theory to understand how parameterization affects learned features.
Contribution
It identifies parameterizations that allow well-defined infinite limits and analyzes different infinite regimes of transformers, providing a theoretical framework for understanding their training dynamics.
Findings
Different infinite limits have distinct statistical descriptions.
Parameterization influences the features learned by transformers.
Numerical evidence supports convergence to the theoretical limits.
Abstract
In this work, we analyze various scaling limits of the training dynamics of transformer models in the feature learning regime. We identify the set of parameterizations that admit well-defined infinite width and depth limits, allowing the attention layers to update throughout training--a relevant notion of feature learning in these models. We then use tools from dynamical mean field theory (DMFT) to analyze various infinite limits (infinite key/query dimension, infinite heads, and infinite depth) which have different statistical descriptions depending on which infinite limit is taken and how attention layers are scaled. We provide numerical evidence of convergence to the limits and discuss how the parameterization qualitatively influences learned features.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsControl and Stability of Dynamical Systems · Control Systems in Engineering · Physics and Engineering Research Articles
MethodsSparse Evolutionary Training
