DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction
Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen

TL;DR
DiffKV introduces a differentiated memory management framework for large language models that significantly improves KV cache compression and throughput by exploiting fine-grained distinctions and dynamic sparsity patterns.
Contribution
It presents a novel on-GPU memory manager that compacts fragmented memory and leverages three levels of differentiation in KV caches for efficient compression.
Findings
Achieves 2.7x to 5.7x KV cache compression with near-lossless accuracy.
Enhances throughput by 1.9x to 5.4x on various LLMs.
Effectively manages irregular memory patterns for scalable LLM serving.
Abstract
Large language models (LLMs) demonstrate remarkable capabilities but face substantial serving costs due to their high memory demands, with the key-value (KV) cache being a primary bottleneck. State-of-the-art KV cache compression techniques, such as quantization and pruning, apply uniform treatment to both keys and values, and discard unimportant tokens entirely, overlooking the fine-grained distinctions in the significance of individual KV cache components. To address such limitations, we introduce \textit{DiffKV}, a novel framework for efficient KV cache compression that exploits three levels of differentiation in the KV cache: (1) the differing impact of keys and values on attention computation, (2) the varying importance of tokens, and (3) the diverse dynamic sparsity patterns across attention heads. These levels of differentiation introduce irregular memory usage patterns across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Topic Modeling
MethodsSoftmax · Attention Is All You Need · Pruning
