Uniform Scaling Limits in AdamW-Trained Transformers

William Gibson; Christoph Reisinger

arXiv:2605.11059·stat.ML·May 13, 2026

Uniform Scaling Limits in AdamW-Trained Transformers

William Gibson, Christoph Reisinger

PDF

TL;DR

This paper analyzes the large-depth limit of transformers trained with AdamW, modeling hidden-state dynamics as an interacting particle system and deriving convergence results to a system of ODEs.

Contribution

It introduces a novel approach to study the asymptotic behavior of AdamW-trained transformers using interacting particle systems and McKean--Vlasov ODEs.

Findings

01

Convergence of hidden states to a system of ODEs at rate O(L^{-1}+L^{-1/3}H^{-1/2})

02

Uniform bounds on the difference between discrete and continuous models independent of tokens

03

Bounds become independent of token embedding dimension with suitable AdamW adaptation

Abstract

We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hidden states and backpropagated variables converge in $L^{2}$ , uniformly over the initial condition, to the solution of a forward--backward system of ODEs at rate $O (L^{- 1} + L^{- 1/3} H^{- 1/2})$ . Here, $L$ and $H$ denote the depth and number of heads of the transformer, respectively. The limiting system of ODEs can be identified with a McKean--Vlasov ODE (MVODE) when the attention heads do not incorporate causal masking. By using the flow maps associated with this MVODE and applying concentration of measure techniques, we obtain bounds on the difference between the discrete and continuous models that are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.