Linearized Attention Cannot Enter the Kernel Regime at Any Practical Width
Jose Marie Antonio Mi\~noza, Paulo Mario P. Medina, Sebastian C. Iba\~nez

TL;DR
This paper demonstrates that linearized attention mechanisms do not reach the kernel regime at practical widths, revealing fundamental limitations in their learning dynamics and influence malleability.
Contribution
It establishes the non-convergence of linearized attention to its NTK limit at feasible widths and introduces influence malleability as a key characteristic.
Findings
Linearized attention requires infeasibly large widths for NTK convergence.
Attention transformation significantly amplifies the condition number of the input Gram matrix.
Linearized attention shows higher influence malleability than ReLU networks under adversarial perturbations.
Abstract
Understanding whether attention mechanisms converge to the kernel regime is foundational to the validity of influence functions for transformer accountability. Exact NTK characterization of softmax attention is precluded by its exponential nonlinearity; linearized attention is the canonical tractable proxy and the object of study here. This paper establishes that even this proxy does not converge to its NTK limit at any practical width, revealing a fundamental trade-off in the learning dynamics of attention. An exact correspondence is established between parameter-free linearized attention and a data-dependent Gram-induced kernel; spectral amplification analysis shows that the attention transformation cubes the Gram matrix's condition number, requiring width for NTK convergence, where is the effective condition number…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
