Uniform Scaling Limits in AdamW-Trained Transformers
William Gibson, Christoph Reisinger

TL;DR
This paper analyzes the large-depth limit of transformers trained with AdamW, modeling hidden-state dynamics as an interacting particle system and deriving convergence results to a system of ODEs.
Contribution
It introduces a novel approach to study the asymptotic behavior of AdamW-trained transformers using interacting particle systems and McKean--Vlasov ODEs.
Findings
Convergence of hidden states to a system of ODEs at rate O(L^{-1}+L^{-1/3}H^{-1/2})
Uniform bounds on the difference between discrete and continuous models independent of tokens
Bounds become independent of token embedding dimension with suitable AdamW adaptation
Abstract
We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hidden states and backpropagated variables converge in , uniformly over the initial condition, to the solution of a forward--backward system of ODEs at rate . Here, and denote the depth and number of heads of the transformer, respectively. The limiting system of ODEs can be identified with a McKean--Vlasov ODE (MVODE) when the attention heads do not incorporate causal masking. By using the flow maps associated with this MVODE and applying concentration of measure techniques, we obtain bounds on the difference between the discrete and continuous models that are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
