$k$NN Attention Demystified: A Theoretical Exploration for Scalable Transformers
Themistoklis Haris

TL;DR
This paper provides a theoretical analysis of $k$NN attention in Transformers, introduces efficient approximation algorithms, and demonstrates their practical benefits in training and inference for scalable models.
Contribution
It establishes a theoretical framework for $k$NN attention, proposes novel sub-quadratic algorithms, and empirically validates their effectiveness.
Findings
Theoretical guarantees for $k$NN attention approximation
Development of sub-quadratic gradient approximation algorithms
Empirical improvements in training and inference efficiency
Abstract
Despite their power, Transformers face challenges with long sequences due to the quadratic complexity of self-attention. To address this limitation, methods like -Nearest-Neighbor (NN) attention have been introduced [Roy, Saffar, Vaswani, Grangier, 2021] enabling each token to attend to only its closest tokens. While NN attention has shown empirical success in making Transformers more efficient, its exact approximation guarantees have not been theoretically analyzed. In this work, we establish a theoretical framework for NN attention, reformulating self-attention as expectations over softmax distributions and leveraging lazy Gumbel sampling [Mussmann, Levy, Ermon, 2017] with NN indices for efficient approximation. Building on this framework, we also propose novel sub-quadratic algorithms that approximate self-attention gradients by leveraging efficient sampling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques · Advanced Data Storage Technologies
MethodsAttention Is All You Need · Softmax
