On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
Gabriel Mongaras, Eric C. Larson

TL;DR
This paper explores the expressiveness of softmax attention by deriving its recurrent form, revealing why it outperforms linear attention methods in neural network models.
Contribution
It introduces a recurrent neural network perspective of softmax attention, explaining its superior expressiveness compared to linear attention methods.
Findings
Softmax attention can be represented as a recurrent neural network.
Ablation of components reveals their roles in attention expressiveness.
Softmax attention's nonlinearity contributes to its higher accuracy.
Abstract
Since its introduction, softmax attention has become the backbone of modern transformer architectures due to its expressiveness and scalability across a wide range of tasks. However, the main drawback of softmax attention is the quadratic memory requirement and computational complexity with respect to the sequence length. By replacing the softmax nonlinearity, linear attention and similar methods have been introduced to avoid the quadratic bottleneck of softmax attention. Despite these linear forms of attention being derived from the original softmax formulation, they typically lag in terms of downstream accuracy. While strong intuition of the softmax nonlinearity on the query and key inner product suggests that it has desirable properties compared to other nonlinearities, the question of why this discrepancy exists still remains unanswered. This work demonstrates that linear attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Time Series Analysis · Advanced Thermodynamics and Statistical Mechanics · Neural Networks and Applications
