Inference-time sparse attention with asymmetric indexing
Pierre-Emmanuel Mazar\'e, Gergely Szilvasy, Maria Lomeli, Francisco Massa, Naila Murray, Herv\'e J\'egou, Matthijs Douze

TL;DR
This paper introduces Saap, an asymmetric indexing method for self-attention that improves efficiency by reducing memory lookups and computation time in large language models through data-adaptive sparsity patterns.
Contribution
Saap employs distinct partitions for keys and queries, overcoming limitations of standard partitioning methods and enabling efficient inference in large pretrained models.
Findings
Reduces memory lookup by a factor of 20 on large models
Achieves approximately 60% time savings compared to FlashAttention-v2
Effective on models with sequences up to 500k tokens
Abstract
Self-attention in transformer models is an incremental associative memory that maps key vectors to value vectors. One way to speed up self-attention is to employ GPU-compatible vector search algorithms based on standard partitioning methods such as k-means. However, such partitioning methods yield poor results in this context because (1) the keys and queries follow different distributions, and (2) the RoPE positional encoding hinders the bucket assignment. This paper introduces Saap (Self-Attention with Asymmetric Partitions), which overcomes these problems. It is an asymmetrical indexing technique that employs distinct partitions for keys and queries, thereby approximating self-attention with a data-adaptive sparsity pattern. It works on pretrained language models and only requires to train (offline) a small query classifier. On a long context Llama 3.1-8b model, with sequences…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBlind Source Separation Techniques · Neural Networks and Reservoir Computing · Sparse and Compressive Sensing Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · LLaMA
