A Survey on Large Language Model Acceleration based on KV Cache Management

Haoyang Li; Yiming Li; Anxin Tian; Tianhao Tang; Zhanchao Xu; Xuejia Chen; Nicole Hu; Wei Dong; Qing Li; Lei Chen

arXiv:2412.19442·cs.AI·July 31, 2025·3 cites

A Survey on Large Language Model Acceleration based on KV Cache Management

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen

PDF

Open Access 1 Repo

TL;DR

This survey reviews various strategies for managing KV cache in large language models to improve inference speed and memory efficiency, covering token, model, and system-level techniques and benchmarks.

Contribution

It provides a comprehensive taxonomy and comparative analysis of KV cache management methods, aiding future research and practical deployment of LLMs.

Findings

01

Token-level strategies improve cache efficiency and reduce redundancy.

02

Model-level innovations enhance KV reuse and model architecture.

03

System-level optimizations optimize memory and hardware utilization.

Abstract

Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimization technique for accelerating LLM inference by reducing redundant computations and improving memory utilization. This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations. Token-level strategies include KV cache selection, budget allocation, merging, quantization, and low-rank decomposition,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

treeai-lab/awesome-kv-cache-management
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Packet Processing and Optimization · Caching and Content Delivery · Recommender Systems and Techniques

MethodsSoftmax · Attention Is All You Need · Focus