Subquadratic Algorithms and Hardness for Attention with Any Temperature
Shreya Gupta, Boyang Huang, Barna Saha, Yinzhan Xu, Christopher Ye

TL;DR
This paper characterizes when subquadratic algorithms for Attention are possible across different temperature regimes, providing new algorithms for low-dimensional cases and proving optimality results under complexity assumptions.
Contribution
It introduces the first subquadratic Attention algorithm for constant dimension and large entry size, and establishes hardness results for higher dimensions under SETH.
Findings
Subquadratic Attention algorithms exist for constant dimension with polylogarithmic dependence on entry size.
Hardness results show no significant improvement is possible for higher dimensions under SETH.
The standard Attention algorithm is optimal for high-dimensional cases under common complexity assumptions.
Abstract
Despite the popularity of the Transformer architecture, the standard algorithm for computing Attention suffers from quadratic time complexity in context length . Alman and Song [NeurIPS 2023] showed that when the head dimension , subquadratic Attention is possible if and only if the inputs have small entries bounded by in absolute values, under the Strong Exponential Time Hypothesis (). Equivalently, subquadratic Attention is possible if and only if the softmax is applied with high temperature for . Running times of these algorithms depend exponentially on and thus they do not lead to even a polynomial-time algorithm outside the specific range of . This naturally leads to the question: when can Attention be computed efficiently without strong assumptions on temperature? Are there fast attention…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper proposes a sub-quadratic algorithm for fast attention calculation with arbitrary temperatures within $\tilde{O}(n^{2-1/d})$ runtime. 2. There are complete complexity characterizations from constant to polynomial head dimensions. 3. The technical approach is elegant, combining polynomial approximation with range searching.
1. The presentation and paper structure are not quite clear, making it hard to identify which parts are original. The technical overview could focus more on main insights or intuitions, deferring technical details to appendices. 2. The main concern of this work is lacking empirical comparisons between the proposed algorithm and existing methods (as baselines), making it difficult to evaluate their practical performance and efficiency. 3. Following 2, although the theoretical contributions
1. Fundamental theoretical breakthrough. The paper resolves a key open question by providing the first subquadratic attention algorithm that scales polylogarithmically (rather than exponentially) with entry size $B$, enabling efficient computation for arbitrary temperature parameters. This represents a significant advance over Alman & Song (2024a), whose algorithms only worked for $B=o(\sqrt{\log n})$. 2. Novel technical approach with elegant insights. The combination of polynomial approximati
1. Limited practical applicability for realistic parameters. The algorithm only achieves subquadratic time for constant d, while practical Transformers commonly use d = 64, 128, or larger. As stated on lines 136–138, “when $d=\omega(1)$, the above algorithms requires $n^{2-o(1)}$ time,” meaning no improvement over the standard $\mathcal{O}(n^{2}d)$ algorithm for most practical settings. 2. (minor) Limited experimental validation. Without experiments, it is unclear whether the algorithm is pract
The paper has many strengths in particular developing an algorithm for approximate attention with a poly logarithmic dependence on $B$ when $d=O(1)$ which is a significant improvement over the previous works, and improving the lower bound for even smaller values of $d$, i.e. $d=2^{\Omega(\log^{*}n)}$.
One mild weakness is that the result appears to be entirely theoretical, and the algorithm not practical. However this is not a big negative as the theoretical contribution is substantial.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · CCD and CMOS Imaging Sensors
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Softmax
