The Linear Attention Resurrection in Vision Transformer
Chuanyang Zheng

TL;DR
This paper introduces L²ViT, a novel vision transformer architecture that combines linear attention with local window attention, enabling efficient high-resolution image processing while maintaining strong global and local feature representation.
Contribution
The paper proposes a new linear attention method with a local concentration module, creating L²ViT, which captures global and local features efficiently without sacrificing performance.
Findings
L²ViT achieves 84.4% Top-1 accuracy on ImageNet-1K.
Pre-training on ImageNet-22k boosts accuracy to 87.0%.
L²ViT performs well on object detection and semantic segmentation.
Abstract
Vision Transformers (ViTs) have recently taken computer vision by storm. However, the softmax attention underlying ViTs comes with a quadratic complexity in time and memory, hindering the application of ViTs to high-resolution images. We revisit the attention design and propose a linear attention method to address the limitation, which doesn't sacrifice ViT's core advantage of capturing global representation like existing methods (e.g. local window attention of Swin). We further investigate the key difference between linear attention and softmax attention. Our empirical results suggest that linear attention lacks a fundamental property of concentrating the distribution of the attention matrix. Inspired by this observation, we introduce a local concentration module to enhance linear attention. By incorporating enhanced linear global attention and local window attention, we propose a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrared Target Detection Methodologies · CCD and CMOS Imaging Sensors · Neural Networks and Applications
MethodsAttention Is All You Need · Softmax
