Bridging the Divide: Reconsidering Softmax and Linear Attention
Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xuran Pan, Xiu Li,, Jiwen Lu, Shiji Song, Gao Huang

TL;DR
This paper analyzes the core differences between Softmax and linear attention in Vision Transformers, providing theoretical insights and practical methods to enhance linear attention's performance and scalability.
Contribution
It introduces theoretical analyses of linear attention's limitations and proposes ways to improve its effectiveness, bridging the gap with Softmax attention.
Findings
Linear attention is not injective, leading to semantic confusion.
Local modeling is crucial for Softmax attention's success.
Enhanced linear attention can outperform Softmax in various tasks.
Abstract
Widely adopted in modern Vision Transformer designs, Softmax attention can effectively capture long-range visual information; however, it incurs excessive computational cost when dealing with high-resolution inputs. In contrast, linear attention naturally enjoys linear complexity and has great potential to scale up to higher-resolution images. Nonetheless, the unsatisfactory performance of linear attention greatly limits its practical application in various scenarios. In this paper, we take a step forward to close the gap between the linear and Softmax attention with novel theoretical analyses, which demystify the core factors behind the performance deviations. Specifically, we present two key perspectives to understand and alleviate the limitations of linear attention: the injective property and the local modeling ability. Firstly, we prove that linear attention is not injective, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection
MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing · Layer Normalization
