Why Softmax Attention Outperforms Linear Attention

Yichuan Deng; Zhao Song; Kaijun Yuan; Tianyi Zhou

arXiv:2310.11685·cs.CL·March 16, 2026·1 cites

Why Softmax Attention Outperforms Linear Attention

Yichuan Deng, Zhao Song, Kaijun Yuan, Tianyi Zhou

PDF

Open Access

TL;DR

This paper investigates why softmax attention generally outperforms linear attention in transformer models, providing a theoretical analysis to explain the performance gap and guiding future improvements in attention mechanisms.

Contribution

The paper offers a comprehensive theoretical comparison between softmax and linear attention, clarifying the reasons for softmax's superior performance.

Findings

01

Softmax attention captures token interactions more effectively.

02

Linear attention's approximation leads to performance degradation.

03

Theoretical insights explain the practical performance gap.

Abstract

Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function. Conversely, linear attention presents a more computationally efficient alternative by approximating the softmax operation with linear complexity. However, it exhibits substantial performance degradation when compared to the traditional softmax attention mechanism. In this paper, we bridge the gap in our theoretical understanding of the reasons behind the practical performance gap between softmax and linear attention. By conducting a comprehensive comparative analysis of these two attention mechanisms, we shed light on the underlying reasons for why softmax attention outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Neural Networks and Applications

MethodsSoftmax