Rectifying Magnitude Neglect in Linear Attention
Qihang Fan, Huaibo Huang, Yuang Ai, Ran He

TL;DR
This paper identifies the neglect of magnitude information in Linear Attention as a key issue and proposes Magnitude-Aware Linear Attention (MALA) to improve performance across diverse vision and language tasks.
Contribution
It introduces MALA, a novel modification to Linear Attention that incorporates Query magnitude, aligning its attention distribution more closely with Softmax Attention.
Findings
MALA significantly improves accuracy in image classification and object detection.
MALA achieves state-of-the-art results across multiple vision and NLP tasks.
The approach demonstrates broad applicability and effectiveness.
Abstract
As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation with Softmax Attention while achieving linear complexity, enabling efficient global information modeling. Nevertheless, Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention. In this paper, we analyze the underlying causes of this issue based on the formulation of Linear Attention. We find that, unlike Softmax Attention, Linear Attention entirely disregards the magnitude information of the Query. This prevents the attention score distribution from dynamically adapting as the Query scales. As a result, despite its structural similarity to Softmax Attention, Linear Attention exhibits a significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
