Homogenized Transformers
Hugo Koubbi, Borjan Geshkovski, Philippe Rigollet

TL;DR
This paper models deep multi-head self-attention as a particle system and derives a homogenized limit, revealing insights into representation collapse and trade-offs in transformer architectures.
Contribution
It introduces a novel random model of transformers, proves a homogenized limit under joint scalings, and analyzes representation collapse and clustering regimes.
Findings
Homogenized limit can be deterministic or stochastic with common noise.
In the Gaussian case, the drift vanishes, enabling analysis of collapse.
Trade-offs between dimension, context length, and temperature are identified.
Abstract
We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of training. Viewing depth as a time variable, the residual stream defines a discrete-time interacting particle system on the unit sphere. We prove that, under suitable joint scalings of the depth, the residual step size, and the number of heads, this dynamics admits a nontrivial homogenized limit. Depending on the scaling, the limit is either deterministic or stochastic with common noise; in the mean-field regime, the latter leads to a stochastic nonlinear Fokker--Planck equation for the conditional law of a representative token. In the Gaussian setting, the limiting drift vanishes, making the homogenized dynamics explicit enough to study representation collapse. This yields quantitative trade-offs between dimension, context length,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
