Neighborhood Attention Transformer
Ali Hassani, Steven Walton, Jiachen Li, Shen Li, Humphrey Shi

TL;DR
This paper introduces Neighborhood Attention, an efficient local attention mechanism for vision transformers, and demonstrates its advantages in speed, memory, and performance over existing methods like Swin Transformer.
Contribution
The paper proposes Neighborhood Attention, a scalable sliding-window attention mechanism, and develops a hierarchical transformer architecture that outperforms comparable models on key vision benchmarks.
Findings
Neighborhood Attention is faster and uses less memory than Swin's window self-attention.
NAT achieves competitive accuracy on ImageNet, COCO, and ADE20K datasets.
Open source implementation and checkpoints are provided for further research.
Abstract
We present Neighborhood Attention (NA), the first efficient and scalable sliding-window attention mechanism for vision. NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity compared to the quadratic complexity of SA. The sliding-window pattern allows NA's receptive field to grow without needing extra pixel shifts, and preserves translational equivariance, unlike Swin Transformer's Window Self Attention (WSA). We develop NATTEN (Neighborhood Attention Extension), a Python package with efficient C++ and CUDA kernels, which allows NA to run up to 40% faster than Swin's WSA while using up to 25% less memory. We further present Neighborhood Attention Transformer (NAT), a new hierarchical transformer design based on NA that boosts image classification and downstream vision performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection
MethodsNeighborhood Attention
