Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self   Attention at the Threadblock Level

Ali Hassani; Wen-Mei Hwu; Humphrey Shi

arXiv:2403.04690·cs.CV·November 1, 2024·1 cites

Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level

Ali Hassani, Wen-Mei Hwu, Humphrey Shi

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces optimized GPU kernels for neighborhood attention, significantly reducing computational costs and memory usage, enabling faster and more scalable attention mechanisms for high-dimensional data.

Contribution

The authors develop new batched GEMM-based kernels for 1-D and 2-D neighborhood attention and propose fused attention implementations to improve efficiency and runtime performance.

Findings

01

895% and 272% runtime improvement over naive kernels

02

Fused neighborhood attention reduces memory footprint and enhances speed

03

Inherent inefficiencies in unfused implementations are mitigated by fusion techniques

Abstract

Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we aim to massively improve upon existing infrastructure by providing two new methods for implementing neighborhood attention. We first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shi-labs/natten
pytorchOfficial

Videos

Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level· slideslive

Taxonomy

TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques

MethodsNeighborhood Attention