On the Computational Hardness of Transformers

Barna Saha; Yinzhan Xu; Christopher Ye; Hantao Yu

arXiv:2603.11332·cs.CC·March 13, 2026

On the Computational Hardness of Transformers

Barna Saha, Yinzhan Xu, Christopher Ye, Hantao Yu

PDF

Open Access

TL;DR

This paper proves that computing multiple attention heads in transformers cannot be significantly optimized beyond straightforward methods, establishing fundamental computational lower bounds for different embedding regimes.

Contribution

It provides the first non-trivial lower bounds for multi-head multi-layer transformers, showing optimality of current algorithms under standard complexity assumptions.

Findings

01

In small embedding regime, computing all attention heads takes near-optimal $LHN^{2+o(1)}$ time.

02

In large embedding regime, computing all attention heads requires $LHN^{ ext{omega}+o(1)}$ operations, matching known upper bounds.

03

The paper introduces a novel application of the Baur-Strassen theorem to establish lower bounds in the large embedding regime.

Abstract

The transformer has revolutionized modern AI across language, vision, and beyond. It consists of $L$ layers, each running $H$ attention heads in parallel and feeding the combined output to the subsequent layer. In attention, the input consists of $N$ tokens, each a vector of dimension $m$ . The attention mechanism involves multiplying three $N \times m$ matrices, applying softmax to an intermediate product. Several recent works have advanced our understanding of the complexity of attention. Known algorithms for transformers compute each attention head independently. This raises a fundamental question that has recurred throughout TCS under the guise of ``direct sum'' problems: can multiple instances of the same problem be solved more efficiently than solving each instance separately? Many answers to this question, both positive and negative, have arisen in fields spanning communication…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplexity and Algorithms in Graphs · Stochastic Gradient Optimization Techniques · Quantum Computing Algorithms and Architecture