A Survey on Large Language Model Acceleration based on KV Cache Management
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen

TL;DR
This survey reviews various strategies for managing KV cache in large language models to improve inference speed and memory efficiency, covering token, model, and system-level techniques and benchmarks.
Contribution
It provides a comprehensive taxonomy and comparative analysis of KV cache management methods, aiding future research and practical deployment of LLMs.
Findings
Token-level strategies improve cache efficiency and reduce redundancy.
Model-level innovations enhance KV reuse and model architecture.
System-level optimizations optimize memory and hardware utilization.
Abstract
Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimization technique for accelerating LLM inference by reducing redundant computations and improving memory utilization. This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations. Token-level strategies include KV cache selection, budget allocation, merging, quantization, and low-rank decomposition,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Packet Processing and Optimization · Caching and Content Delivery · Recommender Systems and Techniques
MethodsSoftmax · Attention Is All You Need · Focus
