Joint Encoding of KV-Cache Blocks for Scalable LLM Serving
Joseph Kampeas, Emir Haleva

TL;DR
This paper introduces a joint encoding method for KV-cache blocks in large language models, significantly reducing memory usage and increasing inference throughput without hardware changes.
Contribution
It proposes a novel joint encoding technique that fuses similar KV-cache blocks, enabling scalable, high-concurrency LLM serving with theoretical and empirical validation.
Findings
Achieves up to 4.38× KV-cache compression with negligible accuracy loss.
Improves token throughput by approximately 40% in real LLM serving scenarios.
Outperforms recent compression baselines in diverse benchmarks.
Abstract
Modern large language models (LLMs) drive interactive AI systems but are bottlenecked by the memory-heavy growth of key-value (KV) caches, which limits real-time throughput under concurrent loads. Existing KV-cache compression methods rely on rigid heuristics, disrupt tensor layouts, or require specialized compute, hindering scalability and deployment. We propose joint encoding of KV-cache blocks, which fuses similar blocks across requests and input chunks into shared representations while preserving standard cache structure. This alleviates the KV-cache memory bottleneck, supporting high-concurrency serving without specialized hardware. Theoretically, we analyze the rate-distortion tradeoff of fused cache blocks under a Poisson process model. Empirically, our method achieves up to 4.38 KV-cache compression with negligible accuracy loss across diverse LLMs and benchmarks,…
Peer Reviews
Decision·Submitted to ICLR 2026
The system's predictions are very accurate. The method's concept is simple but insightful. The layout-preserving design is a big plus. There is some theoretical support rather than it being purely heuristic. The experimental results are convincing in direction.
The presentation of actual system benefits is insufficient. The paper emphasizes scalable, high-throughput serving, but the key results are mostly offline F1 + CR curves. There are no clear end-to-end metrics, such as tokens/s, QPS, P95 latency, or the reduction in cross-machine KV migration traffic on actual vLLM / DistServe / Mooncake-like clusters. The paper even mentions that the current evaluation setup does not directly reflect acceleration effects, which somewhat mismatches the title. T
Strength: This paper presents a novel lossy KV cache compression method that effectively generalizes previous lossless approaches based on exact prefix matching. Given that the LLM decoding phase is intrinsically memory-bound, reducing the effective KV cache size per token—assuming the compression overhead remains negligible—represents a promising direction for fundamentally enhancing LLM inference performance. The experimental evaluation focuses on the fundamental rate-distortion trade-off be
Weakness: As articulated in the paper’s introduction and motivation, the ultimate objective is to improve the performance of LLM inference systems. One major concern, however, is the absence of experimental results demonstrating that the proposed KV cache joint-encoding scheme actually leads to improved system performance—in terms of throughput and/or latency. Moreover, the paper does not sufficiently justify that such performance gains can be reliably expected. The compression algorithm itself
Thank you for submitting this paper to ICLR! This paper targets a timely topic: KV cache compression for memory-efficient, long-context LLM inference. The introduced method can be used directly without modifying paged attention design of serving engines like vLLM, since it avoids operation on tensor / layer level of KV cache. I like your two complementary algorithms, which can cover both prefill and decode (most KV cache compression methods just cover prefill as a one-shot compression techniqu
This paper has three critical problems. It is very difficult to draw any scientific, grounded conclusions without resolving them: 1. The proposed method is in its essence a "lossy" approach of KV cache sharing, since similar (not identical!) blocks are fused. This would inevitably lead to accuracy degradation, which could be negligible or severe depending on the particular inference scenario. Furthermore, this could cause issues like positional encoding misalignment, etc., which is a critical
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Natural Language Processing Techniques
