ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Xiang Liu; Zhenheng Tang; Peijie Dong; Zeyu Li; Yue Liu; Bo Li; Xuming Hu; Xiaowen Chu

arXiv:2502.00299·cs.CL·October 15, 2025·2 cites

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, Xiaowen Chu

PDF

Open Access

TL;DR

ChunkKV introduces a semantic-aware KV cache compression method for long-context LLM inference, preserving linguistic integrity and improving efficiency, outperforming existing methods in accuracy and throughput.

Contribution

It redefines KV cache compression by using semantic chunks, maintaining context and meaning, and introduces a layer-wise index reuse technique to reduce computational overhead.

Findings

01

Improves throughput by 26.5% with index reuse.

02

Outperforms state-of-the-art methods by up to 8.7% in precision.

03

Maintains high compression ratios while preserving semantic integrity.

Abstract

Large Language Models (LLMs) require significant GPU memory when processing long texts, with the key value (KV) cache consuming up to 70\% of total memory during inference. Although existing compression methods reduce memory by evaluating the importance of individual tokens, they overlook critical semantic relationships between tokens, resulting in fragmented context and degraded performance. We introduce ChunkKV, which fundamentally reimagines KV cache compression by treating semantic chunks - rather than isolated tokens - as basic compression units. This approach preserves complete linguistic structures and contextual integrity, ensuring that essential meaning is retained even under aggressive compression. Our innovation includes a novel layer-wise index reuse technique that exploits the higher cross-layer similarity of preserved indices in ChunkKV, reducing computational overhead and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Packet Processing and Optimization · Algorithms and Data Compression · Speech Recognition and Synthesis

MethodsFocus