Joint Encoding of KV-Cache Blocks for Scalable LLM Serving

Joseph Kampeas; Emir Haleva

arXiv:2601.03067·cs.LG·January 7, 2026

Joint Encoding of KV-Cache Blocks for Scalable LLM Serving

Joseph Kampeas, Emir Haleva

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a joint encoding method for KV-cache blocks in large language models, significantly reducing memory usage and increasing inference throughput without hardware changes.

Contribution

It proposes a novel joint encoding technique that fuses similar KV-cache blocks, enabling scalable, high-concurrency LLM serving with theoretical and empirical validation.

Findings

01

Achieves up to 4.38× KV-cache compression with negligible accuracy loss.

02

Improves token throughput by approximately 40% in real LLM serving scenarios.

03

Outperforms recent compression baselines in diverse benchmarks.

Abstract

Modern large language models (LLMs) drive interactive AI systems but are bottlenecked by the memory-heavy growth of key-value (KV) caches, which limits real-time throughput under concurrent loads. Existing KV-cache compression methods rely on rigid heuristics, disrupt tensor layouts, or require specialized compute, hindering scalability and deployment. We propose joint encoding of KV-cache blocks, which fuses similar blocks across requests and input chunks into shared representations while preserving standard cache structure. This alleviates the KV-cache memory bottleneck, supporting high-concurrency serving without specialized hardware. Theoretically, we analyze the rate-distortion tradeoff of fused cache blocks under a Poisson process model. Empirically, our method achieves up to 4.38 $\times$ KV-cache compression with negligible accuracy loss across diverse LLMs and benchmarks,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The system's predictions are very accurate. The method's concept is simple but insightful. The layout-preserving design is a big plus. There is some theoretical support rather than it being purely heuristic. The experimental results are convincing in direction.

Weaknesses

The presentation of actual system benefits is insufficient. The paper emphasizes scalable, high-throughput serving, but the key results are mostly offline F1 + CR curves. There are no clear end-to-end metrics, such as tokens/s, QPS, P95 latency, or the reduction in cross-machine KV migration traffic on actual vLLM / DistServe / Mooncake-like clusters. The paper even mentions that the current evaluation setup does not directly reflect acceleration effects, which somewhat mismatches the title. T

Reviewer 02Rating 2Confidence 4

Strengths

Strength： This paper presents a novel lossy KV cache compression method that effectively generalizes previous lossless approaches based on exact prefix matching. Given that the LLM decoding phase is intrinsically memory-bound, reducing the effective KV cache size per token—assuming the compression overhead remains negligible—represents a promising direction for fundamentally enhancing LLM inference performance. The experimental evaluation focuses on the fundamental rate-distortion trade-off be

Weaknesses

Weakness: As articulated in the paper’s introduction and motivation, the ultimate objective is to improve the performance of LLM inference systems. One major concern, however, is the absence of experimental results demonstrating that the proposed KV cache joint-encoding scheme actually leads to improved system performance—in terms of throughput and/or latency. Moreover, the paper does not sufficiently justify that such performance gains can be reliably expected. The compression algorithm itself

Reviewer 03Rating 2Confidence 4

Strengths

Thank you for submitting this paper to ICLR! This paper targets a timely topic: KV cache compression for memory-efficient, long-context LLM inference. The introduced method can be used directly without modifying paged attention design of serving engines like vLLM, since it avoids operation on tensor / layer level of KV cache. I like your two complementary algorithms, which can cover both prefill and decode (most KV cache compression methods just cover prefill as a one-shot compression techniqu

Weaknesses

This paper has three critical problems. It is very difficult to draw any scientific, grounded conclusions without resolving them: 1. The proposed method is in its essence a "lossy" approach of KV cache sharing, since similar (not identical!) blocks are fused. This would inevitably lead to accuracy degradation, which could be negligible or severe depending on the particular inference scenario. Furthermore, this could cause issues like positional encoding misalignment, etc., which is a critical

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Natural Language Processing Techniques