How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers
Xiao Wang

TL;DR
This paper investigates the theoretical limits of KV cache compression in Transformer models for multi-step reasoning, providing bounds on depth, bandwidth, and error scaling.
Contribution
It introduces new bounds on Transformer depth related to cache size and compression, and analyzes the impact of cache adaptivity on reasoning accuracy.
Findings
Proves upper and lower bounds on Transformer depth with compressed KV caches.
Identifies bandwidth limitations when attention dimension times precision exceeds log n.
Shows adaptive caches outperform oblivious caches in error scaling for multi-hop reasoning.
Abstract
The key-value (KV) cache is the dominant memory bottleneck during Transformer inference, yet little is known theoretically about how aggressively it can be compressed before multi-step reasoning degrades. We study this through -hop pointer chasing on tokens under a shared KV cache of size , attention dimension , heads, -bit precision, and a locality-respecting cache controller (satisfied by all standard KV-compression methods). We give three results. (1) Product depth lower bound (conjectured). We conjecture that any such Transformer (, ) requires depth , and isolate the sole remaining gap as a probabilistic step on the joint distribution of cache trace and pointer chain. Unconditionally, we prove a matching upper bound $L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
