Linearized Attention Cannot Enter the Kernel Regime at Any Practical Width

Jose Marie Antonio Mi\~noza; Paulo Mario P. Medina; Sebastian C. Iba\~nez

arXiv:2603.13085·cs.LG·May 8, 2026

Linearized Attention Cannot Enter the Kernel Regime at Any Practical Width

Jose Marie Antonio Mi\~noza, Paulo Mario P. Medina, Sebastian C. Iba\~nez

PDF

TL;DR

This paper demonstrates that linearized attention mechanisms do not reach the kernel regime at practical widths, revealing fundamental limitations in their learning dynamics and influence malleability.

Contribution

It establishes the non-convergence of linearized attention to its NTK limit at feasible widths and introduces influence malleability as a key characteristic.

Findings

01

Linearized attention requires infeasibly large widths for NTK convergence.

02

Attention transformation significantly amplifies the condition number of the input Gram matrix.

03

Linearized attention shows higher influence malleability than ReLU networks under adversarial perturbations.

Abstract

Understanding whether attention mechanisms converge to the kernel regime is foundational to the validity of influence functions for transformer accountability. Exact NTK characterization of softmax attention is precluded by its exponential nonlinearity; linearized attention is the canonical tractable proxy and the object of study here. This paper establishes that even this proxy does not converge to its NTK limit at any practical width, revealing a fundamental trade-off in the learning dynamics of attention. An exact correspondence is established between parameter-free linearized attention and a data-dependent Gram-induced kernel; spectral amplification analysis shows that the attention transformation cubes the Gram matrix's condition number, requiring width $m = Ω (κ_{d} (G)^{6} n lo g n)$ for NTK convergence, where $κ_{d} (G)$ is the effective condition number…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.