Compression Barriers for Autoregressive Transformers

Themistoklis Haris; Krzysztof Onak

arXiv:2502.15955·cs.DS·February 25, 2025

Compression Barriers for Autoregressive Transformers

Themistoklis Haris, Krzysztof Onak

PDF

Open Access

TL;DR

This paper establishes fundamental space complexity limits for autoregressive Transformer attention mechanisms, proving that sublinear space algorithms are impossible without specific assumptions, and explores compression techniques and time complexity bounds.

Contribution

It provides theoretical lower bounds on memory usage for attention algorithms, introduces a new compression algorithm for sliding window attention, and analyzes the time complexity of token generation.

Findings

01

Any attention-based token generation algorithm requires a(nd) space, with d = a( n)

02

For low-dimensional embeddings, space requirements grow exponentially with dimension, a(d ^d)

03

No non-adaptive algorithm can compute attention in sublinear time for all tokens

Abstract

A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache, but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use $Θ (n d)$ space, where $n$ is the number of tokens generated so far and $d = Ω (lo g n)$ is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime $d = o (lo g n)$ , we show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplexity and Algorithms in Graphs · Stochastic Gradient Optimization Techniques · Cryptography and Data Security

MethodsSoftmax · Attention Is All You Need