Inference-time sparse attention with asymmetric indexing

Pierre-Emmanuel Mazar\'e; Gergely Szilvasy; Maria Lomeli; Francisco Massa; Naila Murray; Herv\'e J\'egou; Matthijs Douze

arXiv:2502.08246·cs.CL·June 4, 2025

Inference-time sparse attention with asymmetric indexing

Pierre-Emmanuel Mazar\'e, Gergely Szilvasy, Maria Lomeli, Francisco Massa, Naila Murray, Herv\'e J\'egou, Matthijs Douze

PDF

Open Access

TL;DR

This paper introduces Saap, an asymmetric indexing method for self-attention that improves efficiency by reducing memory lookups and computation time in large language models through data-adaptive sparsity patterns.

Contribution

Saap employs distinct partitions for keys and queries, overcoming limitations of standard partitioning methods and enabling efficient inference in large pretrained models.

Findings

01

Reduces memory lookup by a factor of 20 on large models

02

Achieves approximately 60% time savings compared to FlashAttention-v2

03

Effective on models with sequences up to 500k tokens

Abstract

Self-attention in transformer models is an incremental associative memory that maps key vectors to value vectors. One way to speed up self-attention is to employ GPU-compatible vector search algorithms based on standard partitioning methods such as k-means. However, such partitioning methods yield poor results in this context because (1) the keys and queries follow different distributions, and (2) the RoPE positional encoding hinders the bucket assignment. This paper introduces Saap (Self-Attention with Asymmetric Partitions), which overcomes these problems. It is an asymmetrical indexing technique that employs distinct partitions for keys and queries, thereby approximating self-attention with a data-adaptive sparsity pattern. It works on pretrained language models and only requires to train (offline) a small query classifier. On a long context Llama 3.1-8b model, with sequences…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlind Source Separation Techniques · Neural Networks and Reservoir Computing · Sparse and Compressive Sensing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · LLaMA