KV Cache Compression for Inference Efficiency in LLMs: A Review

Yanyu Liu (1); Jingying Fu (1); Sixiang Liu (1); Yitian Zou (1); You Fu (1); Jiehan Zhou (1); Shouhua Zhang (2) ((1) Shandong University of Science; Technology; (2) University of Oulu)

arXiv:2508.06297·cs.DC·August 11, 2025

KV Cache Compression for Inference Efficiency in LLMs: A Review

Yanyu Liu (1), Jingying Fu (1), Sixiang Liu (1), Yitian Zou (1), You Fu (1), Jiehan Zhou (1), Shouhua Zhang (2) ((1) Shandong University of Science, Technology, (2) University of Oulu)

PDF

Open Access

TL;DR

This review analyzes various KV cache compression techniques for large language models, highlighting their effectiveness, limitations, and future research directions to improve inference efficiency and scalability.

Contribution

It provides a comprehensive evaluation of current KV cache optimization methods, including their trade-offs and application scenarios, and discusses future research directions.

Findings

01

Compression methods reduce memory usage during inference.

02

Trade-offs exist between compression efficiency and model accuracy.

03

Future directions include hybrid and adaptive optimization strategies.

Abstract

Withtherapid advancement of large language models (LLMs), the context length for inference has been continuously increasing, leading to an exponential growth in the demand for Key-Value (KV) caching. This has resulted in a significant memory bottleneck, limiting the inference efficiency and scalability of the models. Therefore, optimizing the KV cache during inference is crucial for enhancing performance and efficiency. This review systematically examines current KV cache optimization techniques, including compression strategies such as selective token strategies, quantization, and attention compression. We evaluate the effectiveness, trade-offs, and application scenarios of these methods, providing a comprehensive analysis of their impact on memory usage and inference speed. We focus on identifying the limitations and challenges of existing methods, such as compatibility issues with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression