Scaling Limits of Long-Context Transformers
Giuseppe Bruno, Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet

TL;DR
This paper analyzes the behavior of long-context self-attention mechanisms in transformers, revealing how different scaling regimes affect attention distribution and output, with implications for understanding model limits.
Contribution
It characterizes the phase transitions of attention behavior under various scaling regimes and provides explicit limiting laws for attention weights and outputs.
Findings
Critical scaling for selectivity is determined by local distance distribution near zero.
In the subcritical regime, attention averages around the query with Gaussian fluctuations.
In the supercritical regime, attention concentrates on the nearest key.
Abstract
We study the long-context limit of softmax self-attention with a fixed query and a random context of i.i.d. keys on the sphere, viewing the inverse temperature as the scaling parameter that decides whether attention degenerates into uniform averaging or collapses onto the single closest key. We show that the critical scale at which selectivity emerges is determined by the local exponent of the distance-to-query distribution near zero rather than by global features of the context, and scales like for uniform keys on . Furthermore, we characterize the limiting laws of the ordered attention weights and of the attention output across all regimes of : a subcritical regime in which the output reduces to a local average around with explicit deterministic bias and Gaussian fluctuations; a critical regime in which a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
