Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective
Etienne Boursier, Claire Boyer

TL;DR
This paper introduces a measure-based framework to analyze softmax attention in transformers, demonstrating its convergence to linear operators in large prompt regimes and providing tools for understanding training dynamics.
Contribution
It develops a unified measure-based approach to study softmax attention, connecting it to linear attention in large prompt settings, and offers non-asymptotic bounds for its behavior.
Findings
Softmax attention converges to a linear operator with large prompts.
Finite-prompt softmax attention closely approximates its infinite-prompt limit.
The analysis applies to training dynamics in large prompt regimes.
Abstract
Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Adversarial Robustness in Machine Learning · Neural dynamics and brain function
