On The Computational Complexity of Self-Attention

Feyza Duman Keles; Pruthuvi Mahesakya Wijewardena; Chinmay Hegde

arXiv:2209.04881·cs.LG·September 13, 2022·29 cites

On The Computational Complexity of Self-Attention

Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, Chinmay Hegde

PDF

Open Access

TL;DR

This paper proves that the quadratic time complexity of self-attention in transformers is unavoidable under common complexity assumptions, even approximately, and explores approximation methods with trade-offs.

Contribution

It establishes rigorous lower bounds on self-attention complexity and demonstrates feasible approximation techniques with exponential trade-offs.

Findings

01

Quadratic complexity of self-attention is necessary unless SETH is false.

02

Approximate self-attention can be achieved in linear time using Taylor series.

03

Approximation incurs exponential dependence on the polynomial order.

Abstract

Transformer architectures have led to remarkable progress in many state-of-art applications. However, despite their successes, modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the length of the input. Several approaches have been proposed to speed up self-attention mechanisms to achieve sub-quadratic running time; however, the large majority of these works are not accompanied by rigorous error guarantees. In this work, we establish lower bounds on the computational complexity of self-attention in a number of scenarios. We prove that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false. This argument holds even if the attention computation is performed only approximately, and for a variety of attention mechanisms. As a complement to our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Semiconductor materials and devices · Advanced Memory and Neural Computing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings