Attention Mechanisms Through the Lens of Numerical Methods: Approximation Methods and Alternative Formulations
Michel Fabrice Serret, Alice Cortinovis, Yijun Dong, Diana Halikias, Anna Ma, Fabio Matti, Deanna Needell, Katherine J. Pearce, Elizaveta Rebrova, Disha Shur, Rudi Smith, Hai-Xiao Wang, Laura Grigori

TL;DR
This survey reviews various approximation and reformulation techniques for attention mechanisms in Transformers, emphasizing numerical linear algebra tools to improve scalability and efficiency.
Contribution
It systematically classifies and unifies diverse fast attention methods within a numerical analysis framework, highlighting interdisciplinary opportunities.
Findings
Classifies attention approximation methods based on numerical principles
Discusses kernel-inspired reformulations and architectural variants
Highlights potential for further mathematical contributions to scalable attention
Abstract
The attention mechanism is the computational core of modern Transformer architectures, but its quadratic complexity in the input sequence length is the bottleneck for large-scale inference. This has motivated a rapidly growing body of work aimed at accelerating attention through approximation and reformulation. In this survey, we revisit attention mechanisms through the lens of numerical analysis, with a particular emphasis on tools and perspectives from numerical linear algebra. Our goal is twofold: first, we aim to systematically review and classify fast approximation methods according to the numerical principles they exploit. These include sparsity and clustering approaches, low-rank and subspace projection techniques, randomized sketching methods, and tensor-based decompositions. We also discuss kernel-inspired reformulations of attention and recent architectural variants, such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
