On the Computational Hardness of Transformers
Barna Saha, Yinzhan Xu, Christopher Ye, Hantao Yu

TL;DR
This paper proves that computing multiple attention heads in transformers cannot be significantly optimized beyond straightforward methods, establishing fundamental computational lower bounds for different embedding regimes.
Contribution
It provides the first non-trivial lower bounds for multi-head multi-layer transformers, showing optimality of current algorithms under standard complexity assumptions.
Findings
In small embedding regime, computing all attention heads takes near-optimal $LHN^{2+o(1)}$ time.
In large embedding regime, computing all attention heads requires $LHN^{ ext{omega}+o(1)}$ operations, matching known upper bounds.
The paper introduces a novel application of the Baur-Strassen theorem to establish lower bounds in the large embedding regime.
Abstract
The transformer has revolutionized modern AI across language, vision, and beyond. It consists of layers, each running attention heads in parallel and feeding the combined output to the subsequent layer. In attention, the input consists of tokens, each a vector of dimension . The attention mechanism involves multiplying three matrices, applying softmax to an intermediate product. Several recent works have advanced our understanding of the complexity of attention. Known algorithms for transformers compute each attention head independently. This raises a fundamental question that has recurred throughout TCS under the guise of ``direct sum'' problems: can multiple instances of the same problem be solved more efficiently than solving each instance separately? Many answers to this question, both positive and negative, have arisen in fields spanning communication…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplexity and Algorithms in Graphs · Stochastic Gradient Optimization Techniques · Quantum Computing Algorithms and Architecture
