On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

Gabriel Mongaras; Eric C. Larson

arXiv:2507.23632·cs.LG·February 20, 2026

On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

Gabriel Mongaras, Eric C. Larson

PDF

Open Access

TL;DR

This paper explores the expressiveness of softmax attention by deriving its recurrent form, revealing why it outperforms linear attention methods in neural network models.

Contribution

It introduces a recurrent neural network perspective of softmax attention, explaining its superior expressiveness compared to linear attention methods.

Findings

01

Softmax attention can be represented as a recurrent neural network.

02

Ablation of components reveals their roles in attention expressiveness.

03

Softmax attention's nonlinearity contributes to its higher accuracy.

Abstract

Since its introduction, softmax attention has become the backbone of modern transformer architectures due to its expressiveness and scalability across a wide range of tasks. However, the main drawback of softmax attention is the quadratic memory requirement and computational complexity with respect to the sequence length. By replacing the softmax nonlinearity, linear attention and similar methods have been introduced to avoid the quadratic bottleneck of softmax attention. Despite these linear forms of attention being derived from the original softmax formulation, they typically lag in terms of downstream accuracy. While strong intuition of the softmax nonlinearity on the query and key inner product suggests that it has desirable properties compared to other nonlinearities, the question of why this discrepancy exists still remains unanswered. This work demonstrates that linear attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Systems and Time Series Analysis · Advanced Thermodynamics and Statistical Mechanics · Neural Networks and Applications