Attention Mechanisms Through the Lens of Numerical Methods: Approximation Methods and Alternative Formulations

Michel Fabrice Serret; Alice Cortinovis; Yijun Dong; Diana Halikias; Anna Ma; Fabio Matti; Deanna Needell; Katherine J. Pearce; Elizaveta Rebrova; Disha Shur; Rudi Smith; Hai-Xiao Wang; Laura Grigori

arXiv:2604.01757·math.NA·April 3, 2026

Attention Mechanisms Through the Lens of Numerical Methods: Approximation Methods and Alternative Formulations

Michel Fabrice Serret, Alice Cortinovis, Yijun Dong, Diana Halikias, Anna Ma, Fabio Matti, Deanna Needell, Katherine J. Pearce, Elizaveta Rebrova, Disha Shur, Rudi Smith, Hai-Xiao Wang, Laura Grigori

PDF

TL;DR

This survey reviews various approximation and reformulation techniques for attention mechanisms in Transformers, emphasizing numerical linear algebra tools to improve scalability and efficiency.

Contribution

It systematically classifies and unifies diverse fast attention methods within a numerical analysis framework, highlighting interdisciplinary opportunities.

Findings

01

Classifies attention approximation methods based on numerical principles

02

Discusses kernel-inspired reformulations and architectural variants

03

Highlights potential for further mathematical contributions to scalable attention

Abstract

The attention mechanism is the computational core of modern Transformer architectures, but its quadratic complexity in the input sequence length is the bottleneck for large-scale inference. This has motivated a rapidly growing body of work aimed at accelerating attention through approximation and reformulation. In this survey, we revisit attention mechanisms through the lens of numerical analysis, with a particular emphasis on tools and perspectives from numerical linear algebra. Our goal is twofold: first, we aim to systematically review and classify fast approximation methods according to the numerical principles they exploit. These include sparsity and clustering approaches, low-rank and subspace projection techniques, randomized sketching methods, and tensor-based decompositions. We also discuss kernel-inspired reformulations of attention and recent architectural variants, such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.