DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

Jitai Hao; Qiang Huang; Yaowei Wang; Min Zhang; Jun Yu

arXiv:2602.08005·cs.CL·February 10, 2026

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu

PDF

Open Access

TL;DR

DeltaKV introduces a residual-based compression method for KV caches in long-context LLMs, leveraging long-range similarity and shared components to significantly reduce memory while maintaining accuracy, and pairs it with a high-performance inference engine for speedup.

Contribution

The paper presents DeltaKV, a novel residual-based KV cache compression framework that exploits long-range similarity, and Sparse-vLLM, an optimized inference engine, enabling scalable long-context LLM deployment.

Findings

01

Reduces KV cache memory to 29% of original size.

02

Maintains near-lossless accuracy on multiple benchmarks.

03

Achieves up to 2× throughput improvement with Sparse-vLLM.

Abstract

The deployment of efficient long-context LLMs in applications like autonomous agents, long-chain reasoning, and creative writing is fundamentally bottlenecked by the linear growth of KV cache memory. Existing compression and eviction methods often struggle to balance accuracy, compression ratio, and hardware efficiency. We propose DeltaKV, a residual-based KV cache compression framework motivated by two empirical findings: long-range inter-token similarity and highly shared latent components in KV representations. Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage. To translate compression gains into real system speedups, we further introduce Sparse-vLLM, a high-performance inference engine with decoupled memory management and kernels optimized for sparse and irregular KV…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Natural Language Processing Techniques