PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

Ishan Patel; Ishan Joshi

arXiv:2604.24971·cs.LG·April 29, 2026

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

Ishan Patel, Ishan Joshi

PDF

TL;DR

PolyKV introduces a shared, asymmetrically compressed key-value cache system for multi-agent LLM inference, significantly reducing memory usage while maintaining performance.

Contribution

It is the first to combine a shared, lossy-compressed KV cache with multi-reader concurrent access for LLM inference.

Findings

01

Achieves 2.91x compression ratio across configurations.

02

Reduces KV cache memory from 19.8 GB to 0.45 GB with minimal perplexity impact.

03

Perplexity degradation is only +0.57% at 15 agents sharing a 4K-token context.

Abstract

We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE -- a Fast Walsh-Hadamard Transform (FWHT) rotation followed by 3-bit Lloyd-Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct), three context lengths (600-7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama-3-8B with 15 agents sharing a 4K-token context,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.