PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
Ishan Patel, Ishan Joshi

TL;DR
PolyKV introduces a shared, asymmetrically compressed key-value cache system for multi-agent LLM inference, significantly reducing memory usage while maintaining performance.
Contribution
It is the first to combine a shared, lossy-compressed KV cache with multi-reader concurrent access for LLM inference.
Findings
Achieves 2.91x compression ratio across configurations.
Reduces KV cache memory from 19.8 GB to 0.45 GB with minimal perplexity impact.
Perplexity degradation is only +0.57% at 15 agents sharing a 4K-token context.
Abstract
We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE -- a Fast Walsh-Hadamard Transform (FWHT) rotation followed by 3-bit Lloyd-Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct), three context lengths (600-7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama-3-8B with 15 agents sharing a 4K-token context,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
