Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

Etienne Boursier; Claire Boyer

arXiv:2512.11784·cs.LG·December 15, 2025

Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

Etienne Boursier, Claire Boyer

PDF

Open Access

TL;DR

This paper introduces a measure-based framework to analyze softmax attention in transformers, demonstrating its convergence to linear operators in large prompt regimes and providing tools for understanding training dynamics.

Contribution

It develops a unified measure-based approach to study softmax attention, connecting it to linear attention in large prompt settings, and offers non-asymptotic bounds for its behavior.

Findings

01

Softmax attention converges to a linear operator with large prompts.

02

Finite-prompt softmax attention closely approximates its infinite-prompt limit.

03

The analysis applies to training dynamics in large prompt regimes.

Abstract

Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Adversarial Robustness in Machine Learning · Neural dynamics and brain function