Scaling Limits of Long-Context Transformers

Giuseppe Bruno; Shi Chen; Zhengjiang Lin; Yury Polyanskiy; Philippe Rigollet

arXiv:2605.08505·cs.LG·May 12, 2026

Scaling Limits of Long-Context Transformers

Giuseppe Bruno, Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet

PDF

TL;DR

This paper analyzes the behavior of long-context self-attention mechanisms in transformers, revealing how different scaling regimes affect attention distribution and output, with implications for understanding model limits.

Contribution

It characterizes the phase transitions of attention behavior under various scaling regimes and provides explicit limiting laws for attention weights and outputs.

Findings

01

Critical scaling for selectivity is determined by local distance distribution near zero.

02

In the subcritical regime, attention averages around the query with Gaussian fluctuations.

03

In the supercritical regime, attention concentrates on the nearest key.

Abstract

We study the long-context limit of softmax self-attention with a fixed query and a random context of $n$ i.i.d. keys on the sphere, viewing the inverse temperature $β_{n}$ as the scaling parameter that decides whether attention degenerates into uniform averaging or collapses onto the single closest key. We show that the critical scale at which selectivity emerges is determined by the local exponent of the distance-to-query distribution near zero rather than by global features of the context, and scales like $β_{n}^{*} ≍ n^{2/ (d - 1)}$ for uniform keys on $S^{d - 1}$ . Furthermore, we characterize the limiting laws of the ordered attention weights and of the attention output across all regimes of $β_{n}$ : a subcritical regime in which the output reduces to a local average around $q$ with explicit deterministic bias and Gaussian fluctuations; a critical regime in which a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.