PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression

Bo Jiang; Taolue Yang; Youyuan Liu; Xubin He; Sheng Di; Sian Jin

arXiv:2512.24449·cs.DC·January 9, 2026

PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression

Bo Jiang, Taolue Yang, Youyuan Liu, Xubin He, Sheng Di, Sian Jin

PDF

Open Access

TL;DR

PackKV introduces a lossy compression framework for KV caches in large language models, significantly reducing memory footprint and boosting inference throughput without substantial accuracy loss.

Contribution

The paper presents a novel lossy compression technique tailored for KV caches in LLMs, enabling efficient long-context inference with high memory savings and throughput.

Findings

01

Achieves 153.2% higher memory reduction for K cache

02

Achieves 179.6% higher memory reduction for V cache

03

Improves throughput by 75.7% for K and 171.7% for V caches

Abstract

Transformer-based large language models (LLMs) have demonstrated remarkable potential across a wide range of practical applications. However, long-context inference remains a significant challenge due to the substantial memory requirements of the key-value (KV) cache, which can scale to several gigabytes as sequence length and batch size increase. In this paper, we present \textbf{PackKV}, a generic and efficient KV cache management framework optimized for long-context generation. %, which synergistically supports both latency-critical and throughput-critical inference scenarios. PackKV introduces novel lossy compression techniques specifically tailored to the characteristics of KV cache data, featuring a careful co-design of compression algorithms and system architecture. Our approach is compatible with the dynamically growing nature of the KV cache while preserving high computational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Big Data and Digital Economy