Softmax-free Linear Transformers
Jiachen Lu, Junge Zhang, Xiatian Zhu, Jianfeng Feng, Tao Xiang, Li, Zhang

TL;DR
This paper introduces Softmax-Free Transformers (SOFT), a novel approach that replaces softmax-based self-attention with a Gaussian kernel, enabling linear complexity and improved efficiency for vision transformers.
Contribution
The paper proposes a new family of Softmax-Free Transformers using Gaussian kernels and low-rank approximation, addressing limitations of existing methods and enhancing efficiency for visual recognition tasks.
Findings
Significant computational efficiency improvements on ImageNet, COCO, and ADE20K.
Enables processing of much longer token sequences with better accuracy-efficiency trade-offs.
Achieves linear complexity in self-attention, outperforming softmax-based methods.
Abstract
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the development of approximating the self-attention at linear complexity. However, an in-depth analysis in this work reveals that existing methods are either theoretically flawed or empirically ineffective for visual recognition. We identify that their limitations are rooted in the inheritance of softmax-based self-attention during approximations, that is, normalizing the scaled dot-product between token feature vectors using the softmax function. As preserving the softmax operation challenges any subsequent linearization efforts. By this insight, a family of Softmax-Free Transformers (SOFT) are proposed. Specifically, a Gaussian kernel function is adopted to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Residual Connection · Dense Connections · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Absolute Position Encodings · Byte Pair Encoding
