On The Computational Complexity of Self-Attention
Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, Chinmay Hegde

TL;DR
This paper proves that the quadratic time complexity of self-attention in transformers is unavoidable under common complexity assumptions, even approximately, and explores approximation methods with trade-offs.
Contribution
It establishes rigorous lower bounds on self-attention complexity and demonstrates feasible approximation techniques with exponential trade-offs.
Findings
Quadratic complexity of self-attention is necessary unless SETH is false.
Approximate self-attention can be achieved in linear time using Taylor series.
Approximation incurs exponential dependence on the polynomial order.
Abstract
Transformer architectures have led to remarkable progress in many state-of-art applications. However, despite their successes, modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the length of the input. Several approaches have been proposed to speed up self-attention mechanisms to achieve sub-quadratic running time; however, the large majority of these works are not accompanied by rigorous error guarantees. In this work, we establish lower bounds on the computational complexity of self-attention in a number of scenarios. We prove that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false. This argument holds even if the attention computation is performed only approximately, and for a variety of attention mechanisms. As a complement to our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Semiconductor materials and devices · Advanced Memory and Neural Computing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
