Fundamental Limitations on Subquadratic Alternatives to Transformers
Josh Alman, Hantao Yu

TL;DR
This paper proves that any subquadratic alternative to Transformers cannot perform certain important document similarity tasks, establishing a fundamental computational limitation based on complexity theory.
Contribution
It demonstrates that no subquadratic-time algorithm, including heuristic or alternative models, can match Transformers on document similarity tasks under a popular complexity conjecture.
Findings
Transformers can perform document similarity tasks efficiently.
Subquadratic algorithms cannot perform these tasks under the conjecture.
Any faster alternative to attention mechanisms cannot replicate Transformer performance.
Abstract
The Transformer architecture is widely deployed in many popular and impactful Large Language Models. At its core is the attention mechanism for calculating correlations between pairs of tokens. Performing an attention computation takes quadratic time in the input size, and had become the time bottleneck for transformer operations. In order to circumvent this, researchers have used a variety of approaches, including designing heuristic algorithms for performing attention computations faster, and proposing alternatives to the attention mechanism which can be computed more quickly. For instance, state space models such as Mamba were designed to replace attention with an almost linear time alternative. In this paper, we prove that any such approach cannot perform important tasks that Transformer is able to perform (assuming a popular conjecture from fine-grained complexity theory). We…
Peer Reviews
Decision·ICLR 2025 Poster
Although this is not my area of expertise, I believe the paper's main findings, particularly regarding document similarity tasks, have the potential to impact the literature on architectural development. Additionally, to the best of my knowledge, the paper is well-written, the math is accurate and concise.
While the theoretical coverage seems accurate to me, I was disappointed by the lack of empirical evidence supporting the paper's main findings. For instance, experiments demonstrating that subquadratic models cannot solve Max-IP, Min-IP, MSD, and LSD, whereas transformers can, would have been valuable.
The results are theoretical, demonstrating a fundamental limitation of any subquadratic approximations of the transformers' quadratic attention mechanisms. The involved steps seem sound.
I believe the results, while rigorous and sound, have almost no connection with the transformers being used in practice. Generally speaking, the transformers are shown to be very capable of solving different kinds of problems empirically. For example, subquadratic approximations try to show that they can perform similarly to the original transformers but more efficiently, which is orthogonal to the results in the paper. Due to the nature of the results, there are no experimental results. But the
Finally some more mathematically theoretically founded analysis of the prominent architecture and its limitations. It analysis seems rigorous and its offering insights that can influence future research on alternative architectures in NLP.
minors/ missing discussions: 1.1) The paper would benefit from discussing the practical impacts of its findings. I.p. the bounds of its proof seem to be quite practically relevant and not 'entirely asymptotic' - in praxis it could be more relevant to have 'bad asymptotic with good bounds'. 1.2) It would be beneficial to explore the performance of alternative architectures i.p. w.r.t. practicability - i am not sure if a state space machine can't handle OVC in a reasonable depth like 10 layers.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Big Data and Digital Economy · Natural Language Processing Techniques
MethodsDense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Attention Is All You Need · Dropout · Byte Pair Encoding · Absolute Position Encodings
