Softmax-free Linear Transformers

Jiachen Lu; Junge Zhang; Xiatian Zhu; Jianfeng Feng; Tao Xiang; Li; Zhang

arXiv:2207.03341·cs.CV·March 18, 2024·1 cites

Softmax-free Linear Transformers

Jiachen Lu, Junge Zhang, Xiatian Zhu, Jianfeng Feng, Tao Xiang, Li, Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Softmax-Free Transformers (SOFT), a novel approach that replaces softmax-based self-attention with a Gaussian kernel, enabling linear complexity and improved efficiency for vision transformers.

Contribution

The paper proposes a new family of Softmax-Free Transformers using Gaussian kernels and low-rank approximation, addressing limitations of existing methods and enhancing efficiency for visual recognition tasks.

Findings

01

Significant computational efficiency improvements on ImageNet, COCO, and ADE20K.

02

Enables processing of much longer token sequences with better accuracy-efficiency trade-offs.

03

Achieves linear complexity in self-attention, outperforming softmax-based methods.

Abstract

Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the development of approximating the self-attention at linear complexity. However, an in-depth analysis in this work reveals that existing methods are either theoretically flawed or empirically ineffective for visual recognition. We identify that their limitations are rooted in the inheritance of softmax-based self-attention during approximations, that is, normalizing the scaled dot-product between token feature vectors using the softmax function. As preserving the softmax operation challenges any subsequent linearization efforts. By this insight, a family of Softmax-Free Transformers (SOFT) are proposed. Specifically, a Gaussian kernel function is adopted to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fudan-zvg/soft
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Residual Connection · Dense Connections · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Absolute Position Encodings · Byte Pair Encoding