Poly-attention: a general scheme for higher-order self-attention
Sayak Chakrabarti, Toniann Pitassi, Josh Alman

TL;DR
This paper introduces poly-attention, a flexible framework for higher-order self-attention mechanisms that can model complex token interactions efficiently, balancing expressiveness and computational complexity.
Contribution
It defines a broad class of poly-attention mechanisms, analyzes their computational and representational capabilities, and proposes a new quadratic-time attention method for function composition.
Findings
New quadratic-time attention mechanism for function composition
Matching lower bounds for computing attention matrices
Trade-offs between expressiveness and approximation in poly-attention
Abstract
The self-attention mechanism, at the heart of the Transformer model, is able to effectively model pairwise interactions between tokens. However, numerous recent works have shown that it is unable to perform basic tasks involving detecting triples of correlated tokens, or compositional tasks where multiple input tokens need to be referenced to generate a result. Some higher-dimensional alternatives to self-attention have been proposed to address this, including higher-order attention and Strassen attention, which can perform some of these polyadic tasks in exchange for slower, superquadratic running times. In this work, we define a vast class of generalizations of self-attention, which we call poly-attention mechanisms. Our mechanisms can incorporate arbitrary higher-order (tensor) computations as well as arbitrary relationship structures between the input tokens, and they include the…
Peer Reviews
Decision·ICLR 2026 Poster
Originality: The paper introduces a unified framework (poly-attention) that generalizes several recent higher-order attention mechanisms. The introduction of tree-attention and its theoretical and empirical validation is novel and meaningful. Quality: The technical contributions are substantial. The authors provide rigorous complexity analyses, including both upper and lower bounds, and support their claims with proofs and experiments. The connection to fine-grained complexity (e.g., SETH, Max-
Experimental Scope: While the paper includes an experimental validation of tree-attention, the evaluation is limited to a single task (function composition). More diverse benchmarks (e.g., on standard NLP or reasoning tasks) would strengthen the practical relevance of the proposed mechanisms. Presentation of Lower Bounds: The lower-bound proofs, especially those based on fine-grained complexity, are highly technical and may be difficult to follow for a general audience. A more intuitive explana
The paper proposes a "unified poly-attentin" framework that organizes several high order attention machanisms. That seems useful especially for follow up theory work. I like the the represantation ability comparisons without ignoring the computational complexities. I didnt check every proof and the proofs in detail, but the upper/lower bounds story is believable and it explains why other attention variants blow up to superquadratic. Overall exposition is quite clear, mapping the special cases ea
W1)About the motivation (I feel this is crucial): Although it can possibly achieve other functions in the paper the main motivation/achievement of poly-attention is stated around the functional composition problem (even your "punchline" is around it). However, you should explain more clearly why this particular functional form matters. What makes it a good choice? In principle, we can always cook up an unusual functional form and then design a neural network that works well for it. Because of t
1. **Generalization of attention mechanism:** The poly-attention framework itself is an elegant generalization that unifies standard self-attention, tensor attention, and Strassen attention under a single, intuitive parameter: the attention polynomial $h$. 2. **Rigorous theoretical analysis:** The paper's main strength is its rigorous analysis. The authors use tools from computational complexity theory to provide a tight characterization of the running time. The finding that tree polynomials ar
1. **Clarification on core definitions:** The paper's central contribution, the poly-attention mechanism, is presented in Definition 2.2. This definition is inconsistent, making the paper's subsequent claims impossible to verify. Please correct me if I am wrong. The attention polynomial $h$ is defined over **$t$** variables ($x_1, ..., x_t$). However, the exponent in Definition 2.2 is written as $h(Q_{l_{1}}^{(1)},...,Q_{l_{k}}^{(k)})$, incorrectly using the polynomial's *degree* **$k$** as the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsQuantum many-body systems · Ferroelectric and Negative Capacitance Devices · Topic Modeling
