Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Gregory Magarshak

TL;DR
This paper introduces a sequential KV cache compression method using probabilistic language tries and delta coding, surpassing previous vector-based limits and achieving near-theoretical compression ratios.
Contribution
It presents a novel two-layer architecture exploiting language structure for efficient KV cache compression, extending beyond per-vector entropy bounds.
Findings
Achieves a per-token entropy bound of 3.3-4.3 bits at typical language perplexity.
Theoretical compression ratio over TurboQuant is approximately 914,000x.
Compression ratio remains high even at 1000x above the entropy floor.
Abstract
Recent work on KV cache quantization, culminating in TurboQuant, has approached the Shannon entropy limit for per-vector compression of transformer key-value caches. We observe that this limit applies to a strictly weaker problem than the one that actually matters: compressing the KV cache as a sequence. The tokens stored in a KV cache are not arbitrary floating-point data -- they are samples from the exact formal language the model was trained on, and the model is by construction a near-optimal predictor of that language. We introduce sequential KV compression, a two-layer architecture that exploits this structure. The first layer, probabilistic prefix deduplication, identifies semantically equivalent shared prefixes across sessions using the trie metric d_T(s, s') = -log_2 P_M(s ^ s') from Probabilistic Language Tries (PLTs). The second layer, predictive delta coding, stores only the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
