DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction

Yanqi Zhang; Yuwei Hu; Runyuan Zhao; John C.S. Lui; Haibo Chen

arXiv:2412.03131·cs.LG·September 3, 2025

DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction

Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen

PDF

Open Access

TL;DR

DiffKV introduces a differentiated memory management framework for large language models that significantly improves KV cache compression and throughput by exploiting fine-grained distinctions and dynamic sparsity patterns.

Contribution

It presents a novel on-GPU memory manager that compacts fragmented memory and leverages three levels of differentiation in KV caches for efficient compression.

Findings

01

Achieves 2.7x to 5.7x KV cache compression with near-lossless accuracy.

02

Enhances throughput by 1.9x to 5.4x on various LLMs.

03

Effectively manages irregular memory patterns for scalable LLM serving.

Abstract

Large language models (LLMs) demonstrate remarkable capabilities but face substantial serving costs due to their high memory demands, with the key-value (KV) cache being a primary bottleneck. State-of-the-art KV cache compression techniques, such as quantization and pruning, apply uniform treatment to both keys and values, and discard unimportant tokens entirely, overlooking the fine-grained distinctions in the significance of individual KV cache components. To address such limitations, we introduce \textit{DiffKV}, a novel framework for efficient KV cache compression that exploits three levels of differentiation in the KV cache: (1) the differing impact of keys and values on attention computation, (2) the varying importance of tokens, and (3) the diverse dynamic sparsity patterns across attention heads. These levels of differentiation introduce irregular memory usage patterns across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Pruning