KV Cache Optimization Strategies for Scalable and Efficient LLM Inference
Yichun Xu, Navjot K. Khaira, Tejinder Singh

TL;DR
This paper systematically reviews and categorizes various KV cache optimization techniques for large language models, addressing memory and throughput bottlenecks in scalable inference across diverse deployment scenarios.
Contribution
It provides a comprehensive analysis of five main KV cache optimization strategies, their trade-offs, and practical deployment guidance for scalable LLM inference.
Findings
No single technique is best for all scenarios.
Adaptive, multi-stage optimization is promising for future work.
Empirical evaluation across multiple deployment scenarios.
Abstract
The key-value (KV) cache is a foundational optimization in Transformer-based large language models (LLMs), eliminating redundant recomputation of past token representations during autoregressive generation. However, its memory footprint scales linearly with context length, imposing critical bottlenecks on GPU memory capacity, memory bandwidth, and inference throughput as production LLMs push context windows from thousands to millions of tokens. Efficient KV cache management has thus become a first-order challenge for scalable LLM deployment. This paper provides a systematic review of recent KV cache optimization techniques, organizing them into five principal directions: cache eviction, cache compression, hybrid memory solutions, novel attention mechanisms, and combination strategies. For each category we analyze the underlying mechanisms, deployment trade-offs, and empirical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Neural Network Applications
