PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression
Bo Jiang, Taolue Yang, Youyuan Liu, Xubin He, Sheng Di, Sian Jin

TL;DR
PackKV introduces a lossy compression framework for KV caches in large language models, significantly reducing memory footprint and boosting inference throughput without substantial accuracy loss.
Contribution
The paper presents a novel lossy compression technique tailored for KV caches in LLMs, enabling efficient long-context inference with high memory savings and throughput.
Findings
Achieves 153.2% higher memory reduction for K cache
Achieves 179.6% higher memory reduction for V cache
Improves throughput by 75.7% for K and 171.7% for V caches
Abstract
Transformer-based large language models (LLMs) have demonstrated remarkable potential across a wide range of practical applications. However, long-context inference remains a significant challenge due to the substantial memory requirements of the key-value (KV) cache, which can scale to several gigabytes as sequence length and batch size increase. In this paper, we present \textbf{PackKV}, a generic and efficient KV cache management framework optimized for long-context generation. %, which synergistically supports both latency-critical and throughput-critical inference scenarios. PackKV introduces novel lossy compression techniques specifically tailored to the characteristics of KV cache data, featuring a careful co-design of compression algorithms and system architecture. Our approach is compatible with the dynamically growing nature of the KV cache while preserving high computational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Big Data and Digital Economy
