Compression Barriers for Autoregressive Transformers
Themistoklis Haris, Krzysztof Onak

TL;DR
This paper establishes fundamental space complexity limits for autoregressive Transformer attention mechanisms, proving that sublinear space algorithms are impossible without specific assumptions, and explores compression techniques and time complexity bounds.
Contribution
It provides theoretical lower bounds on memory usage for attention algorithms, introduces a new compression algorithm for sliding window attention, and analyzes the time complexity of token generation.
Findings
Any attention-based token generation algorithm requires a(nd) space, with d = a( n)
For low-dimensional embeddings, space requirements grow exponentially with dimension, a(d ^d)
No non-adaptive algorithm can compute attention in sublinear time for all tokens
Abstract
A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache, but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use space, where is the number of tokens generated so far and is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime , we show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplexity and Algorithms in Graphs · Stochastic Gradient Optimization Techniques · Cryptography and Data Security
MethodsSoftmax · Attention Is All You Need
