Breaking the Low-Rank Dilemma of Linear Attention
Qihang Fan, Huaibo Huang, Ran He

TL;DR
This paper introduces Rank-Augmented Linear Attention (RALA) to overcome the low-rank limitations of linear attention, achieving performance comparable to Softmax attention in vision tasks while maintaining linear complexity.
Contribution
The paper proposes RALA, a novel linear attention method that addresses the low-rank issue, and constructs RAVLT, a vision transformer that outperforms previous linear attention models.
Findings
RAVLT achieves 84.4% Top-1 accuracy on ImageNet-1k.
RALA rivals Softmax attention performance with linear complexity.
The approach significantly surpasses previous linear attention mechanisms.
Abstract
The Softmax attention mechanism in Transformer models is notoriously computationally expensive, particularly due to its quadratic complexity, posing significant challenges in vision applications. In contrast, linear attention provides a far more efficient solution by reducing the complexity to linear levels. However, compared to Softmax attention, linear attention often experiences significant performance degradation. Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map, which hinders its ability to adequately model complex spatial information. In this paper, to break the low-rank dilemma of linear attention, we conduct rank analysis from two perspectives: the KV buffer and the output features. Consequently, we introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Science and Education Research
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Residual Connection
