Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level
Ali Hassani, Wen-Mei Hwu, Humphrey Shi

TL;DR
This paper introduces optimized GPU kernels for neighborhood attention, significantly reducing computational costs and memory usage, enabling faster and more scalable attention mechanisms for high-dimensional data.
Contribution
The authors develop new batched GEMM-based kernels for 1-D and 2-D neighborhood attention and propose fused attention implementations to improve efficiency and runtime performance.
Findings
895% and 272% runtime improvement over naive kernels
Fused neighborhood attention reduces memory footprint and enhances speed
Inherent inefficiencies in unfused implementations are mitigated by fusion techniques
Abstract
Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we aim to massively improve upon existing infrastructure by providing two new methods for implementing neighborhood attention. We first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques
MethodsNeighborhood Attention
