Training Infinitely Deep and Wide Transformers
Rapha\"el Barboni, Maarten V. de Hoop, Takashi Furuya, Gabriel Peyr\'e

TL;DR
This paper provides a rigorous mathematical framework for understanding the training dynamics of infinitely deep and wide transformers, modeling them as neural PDEs and analyzing their gradient flows in the mean-field regime.
Contribution
It introduces a novel mean-field model for transformers, establishes well-posedness of their infinite-depth training, and characterizes the conditions for global convergence of gradient descent.
Findings
Well-posedness of infinitely deep transformer forward pass.
Explicit gradient formulas involving adjoint variables.
NTK injectivity linked to token distribution properties.
Abstract
Transformers have become the dominant architecture in modern machine learning, yet the theoretical understanding of their training dynamics remains limited. This paper develops a rigorous mathematical framework for analyzing gradient-based training of transformers in the mean-field regime, where both the depth (number of layers) and width (number of attention heads) tend to infinity. While ResNet training can be understood as controlling a neural ODE, transformer training corresponds to controlling a neural PDE, due to the coupling of multiple token distributions through the attention mechanism. Our mean-field model features two types of measure representations: token distributions evolving through layers and attention parameters at each layer. We establish well-posedness of the forward pass through infinitely deep transformers, characterizing token evolution via flow maps that satisfy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
