$k$NN Attention Demystified: A Theoretical Exploration for Scalable   Transformers

Themistoklis Haris

arXiv:2411.04013·cs.LG·November 11, 2024

$k$NN Attention Demystified: A Theoretical Exploration for Scalable Transformers

Themistoklis Haris

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper provides a theoretical analysis of $k$NN attention in Transformers, introduces efficient approximation algorithms, and demonstrates their practical benefits in training and inference for scalable models.

Contribution

It establishes a theoretical framework for $k$NN attention, proposes novel sub-quadratic algorithms, and empirically validates their effectiveness.

Findings

01

Theoretical guarantees for $k$NN attention approximation

02

Development of sub-quadratic gradient approximation algorithms

03

Empirical improvements in training and inference efficiency

Abstract

Despite their power, Transformers face challenges with long sequences due to the quadratic complexity of self-attention. To address this limitation, methods like $k$ -Nearest-Neighbor ( $k$ NN) attention have been introduced [Roy, Saffar, Vaswani, Grangier, 2021] enabling each token to attend to only its $k$ closest tokens. While $k$ NN attention has shown empirical success in making Transformers more efficient, its exact approximation guarantees have not been theoretically analyzed. In this work, we establish a theoretical framework for $k$ NN attention, reformulating self-attention as expectations over softmax distributions and leveraging lazy Gumbel sampling [Mussmann, Levy, Ermon, 2017] with $k$ NN indices for efficient approximation. Building on this framework, we also propose novel sub-quadratic algorithms that approximate self-attention gradients by leveraging efficient sampling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sansui-123/knn_attention
pytorchOfficial

Videos

kNN Attention Demystified: A Theoretical Exploration for Scalable Transformers· slideslive

Taxonomy

TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques · Advanced Data Storage Technologies

MethodsAttention Is All You Need · Softmax