Bridging the Divide: Reconsidering Softmax and Linear Attention

Dongchen Han; Yifan Pu; Zhuofan Xia; Yizeng Han; Xuran Pan; Xiu Li,; Jiwen Lu; Shiji Song; Gao Huang

arXiv:2412.06590·cs.CV·December 10, 2024

Bridging the Divide: Reconsidering Softmax and Linear Attention

Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xuran Pan, Xiu Li,, Jiwen Lu, Shiji Song, Gao Huang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper analyzes the core differences between Softmax and linear attention in Vision Transformers, providing theoretical insights and practical methods to enhance linear attention's performance and scalability.

Contribution

It introduces theoretical analyses of linear attention's limitations and proposes ways to improve its effectiveness, bridging the gap with Softmax attention.

Findings

01

Linear attention is not injective, leading to semantic confusion.

02

Local modeling is crucial for Softmax attention's success.

03

Enhanced linear attention can outperform Softmax in various tasks.

Abstract

Widely adopted in modern Vision Transformer designs, Softmax attention can effectively capture long-range visual information; however, it incurs excessive computational cost when dealing with high-resolution inputs. In contrast, linear attention naturally enjoys linear complexity and has great potential to scale up to higher-resolution images. Nonetheless, the unsatisfactory performance of linear attention greatly limits its practical application in various scenarios. In this paper, we take a step forward to close the gap between the linear and Softmax attention with novel theoretical analyses, which demystify the core factors behind the performance deviations. Specifically, we present two key perspectives to understand and alleviate the limitations of linear attention: the injective property and the local modeling ability. Firstly, we prove that linear attention is not injective, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leaplabthu/inline
pytorchOfficial

Videos

Bridging the Divide: Reconsidering Softmax and Linear Attention· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection

MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing · Layer Normalization