Subquadratic Algorithms and Hardness for Attention with Any Temperature

Shreya Gupta; Boyang Huang; Barna Saha; Yinzhan Xu; Christopher Ye

arXiv:2505.14840·cs.LG·May 22, 2025

Subquadratic Algorithms and Hardness for Attention with Any Temperature

Shreya Gupta, Boyang Huang, Barna Saha, Yinzhan Xu, Christopher Ye

PDF

Open Access 3 Reviews

TL;DR

This paper characterizes when subquadratic algorithms for Attention are possible across different temperature regimes, providing new algorithms for low-dimensional cases and proving optimality results under complexity assumptions.

Contribution

It introduces the first subquadratic Attention algorithm for constant dimension and large entry size, and establishes hardness results for higher dimensions under SETH.

Findings

01

Subquadratic Attention algorithms exist for constant dimension with polylogarithmic dependence on entry size.

02

Hardness results show no significant improvement is possible for higher dimensions under SETH.

03

The standard Attention algorithm is optimal for high-dimensional cases under common complexity assumptions.

Abstract

Despite the popularity of the Transformer architecture, the standard algorithm for computing Attention suffers from quadratic time complexity in context length $n$ . Alman and Song [NeurIPS 2023] showed that when the head dimension $d = Θ (lo g n)$ , subquadratic Attention is possible if and only if the inputs have small entries bounded by $B = o (lo g n)$ in absolute values, under the Strong Exponential Time Hypothesis ( $SETH$ ). Equivalently, subquadratic Attention is possible if and only if the softmax is applied with high temperature for $d = Θ (lo g n)$ . Running times of these algorithms depend exponentially on $B$ and thus they do not lead to even a polynomial-time algorithm outside the specific range of $B$ . This naturally leads to the question: when can Attention be computed efficiently without strong assumptions on temperature? Are there fast attention…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 2

Strengths

1. This paper proposes a sub-quadratic algorithm for fast attention calculation with arbitrary temperatures within $\tilde{O}(n^{2-1/d})$ runtime. 2. There are complete complexity characterizations from constant to polynomial head dimensions. 3. The technical approach is elegant, combining polynomial approximation with range searching.

Weaknesses

1. The presentation and paper structure are not quite clear, making it hard to identify which parts are original. The technical overview could focus more on main insights or intuitions, deferring technical details to appendices. 2. The main concern of this work is lacking empirical comparisons between the proposed algorithm and existing methods (as baselines), making it difficult to evaluate their practical performance and efficiency. 3. Following 2, although the theoretical contributions

Reviewer 02Rating 8Confidence 5

Strengths

1. Fundamental theoretical breakthrough. The paper resolves a key open question by providing the first subquadratic attention algorithm that scales polylogarithmically (rather than exponentially) with entry size $B$, enabling efficient computation for arbitrary temperature parameters. This represents a significant advance over Alman & Song (2024a), whose algorithms only worked for $B=o(\sqrt{\log n})$. 2. Novel technical approach with elegant insights. The combination of polynomial approximati

Weaknesses

1. Limited practical applicability for realistic parameters. The algorithm only achieves subquadratic time for constant d, while practical Transformers commonly use d = 64, 128, or larger. As stated on lines 136–138, “when $d=\omega(1)$, the above algorithms requires $n^{2-o(1)}$ time,” meaning no improvement over the standard $\mathcal{O}(n^{2}d)$ algorithm for most practical settings. 2. (minor) Limited experimental validation. Without experiments, it is unclear whether the algorithm is pract

Reviewer 03Rating 8Confidence 2

Strengths

The paper has many strengths in particular developing an algorithm for approximate attention with a poly logarithmic dependence on $B$ when $d=O(1)$ which is a significant improvement over the previous works, and improving the lower bound for even smaller values of $d$, i.e. $d=2^{\Omega(\log^{*}n)}$.

Weaknesses

One mild weakness is that the result appears to be entirely theoretical, and the algorithm not practical. However this is not a big negative as the theoretical contribution is substantial.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · CCD and CMOS Imaging Sensors

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Softmax