Homogenized Transformers

Hugo Koubbi; Borjan Geshkovski; Philippe Rigollet

arXiv:2604.01978·math.PR·April 3, 2026

Homogenized Transformers

Hugo Koubbi, Borjan Geshkovski, Philippe Rigollet

PDF

TL;DR

This paper models deep multi-head self-attention as a particle system and derives a homogenized limit, revealing insights into representation collapse and trade-offs in transformer architectures.

Contribution

It introduces a novel random model of transformers, proves a homogenized limit under joint scalings, and analyzes representation collapse and clustering regimes.

Findings

01

Homogenized limit can be deterministic or stochastic with common noise.

02

In the Gaussian case, the drift vanishes, enabling analysis of collapse.

03

Trade-offs between dimension, context length, and temperature are identified.

Abstract

We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of training. Viewing depth as a time variable, the residual stream defines a discrete-time interacting particle system on the unit sphere. We prove that, under suitable joint scalings of the depth, the residual step size, and the number of heads, this dynamics admits a nontrivial homogenized limit. Depending on the scaling, the limit is either deterministic or stochastic with common noise; in the mean-field regime, the latter leads to a stochastic nonlinear Fokker--Planck equation for the conditional law of a representative token. In the Gaussian setting, the limiting drift vanishes, making the homogenized dynamics explicit enough to study representation collapse. This yields quantitative trade-offs between dimension, context length,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.