KV Cache Compression for Inference Efficiency in LLMs: A Review
Yanyu Liu (1), Jingying Fu (1), Sixiang Liu (1), Yitian Zou (1), You Fu (1), Jiehan Zhou (1), Shouhua Zhang (2) ((1) Shandong University of Science, Technology, (2) University of Oulu)

TL;DR
This review analyzes various KV cache compression techniques for large language models, highlighting their effectiveness, limitations, and future research directions to improve inference efficiency and scalability.
Contribution
It provides a comprehensive evaluation of current KV cache optimization methods, including their trade-offs and application scenarios, and discusses future research directions.
Findings
Compression methods reduce memory usage during inference.
Trade-offs exist between compression efficiency and model accuracy.
Future directions include hybrid and adaptive optimization strategies.
Abstract
Withtherapid advancement of large language models (LLMs), the context length for inference has been continuously increasing, leading to an exponential growth in the demand for Key-Value (KV) caching. This has resulted in a significant memory bottleneck, limiting the inference efficiency and scalability of the models. Therefore, optimizing the KV cache during inference is crucial for enhancing performance and efficiency. This review systematically examines current KV cache optimization techniques, including compression strategies such as selective token strategies, quantization, and attention compression. We evaluate the effectiveness, trade-offs, and application scenarios of these methods, providing a comprehensive analysis of their impact on memory usage and inference speed. We focus on identifying the limitations and challenges of existing methods, such as compatibility issues with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression
