KV Cache Optimization Strategies for Scalable and Efficient LLM Inference

Yichun Xu; Navjot K. Khaira; Tejinder Singh

arXiv:2603.20397·cs.LG·March 24, 2026

KV Cache Optimization Strategies for Scalable and Efficient LLM Inference

Yichun Xu, Navjot K. Khaira, Tejinder Singh

PDF

Open Access

TL;DR

This paper systematically reviews and categorizes various KV cache optimization techniques for large language models, addressing memory and throughput bottlenecks in scalable inference across diverse deployment scenarios.

Contribution

It provides a comprehensive analysis of five main KV cache optimization strategies, their trade-offs, and practical deployment guidance for scalable LLM inference.

Findings

01

No single technique is best for all scenarios.

02

Adaptive, multi-stage optimization is promising for future work.

03

Empirical evaluation across multiple deployment scenarios.

Abstract

The key-value (KV) cache is a foundational optimization in Transformer-based large language models (LLMs), eliminating redundant recomputation of past token representations during autoregressive generation. However, its memory footprint scales linearly with context length, imposing critical bottlenecks on GPU memory capacity, memory bandwidth, and inference throughput as production LLMs push context windows from thousands to millions of tokens. Efficient KV cache management has thus become a first-order challenge for scalable LLM deployment. This paper provides a systematic review of recent KV cache optimization techniques, organizing them into five principal directions: cache eviction, cache compression, hybrid memory solutions, novel attention mechanisms, and combination strategies. For each category we analyze the underlying mechanisms, deployment trade-offs, and empirical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Neural Network Applications