Comparative Characterization of KV Cache Management Strategies for LLM Inference
Oteo Mamo, Olga Kogiou, Hyunjin Yi, Weikuan Yu

TL;DR
This paper empirically compares three KV cache management frameworks for LLM inference, analyzing their trade-offs in memory and performance across various conditions.
Contribution
It provides a detailed evaluation of vLLM, InfiniGen, and H2O, revealing their strengths and optimal use cases for KV cache strategies.
Findings
Different frameworks excel under varying request rates and model sizes.
Memory consumption and latency trade-offs depend on cache management techniques.
Guidelines for selecting suitable KV cache strategies based on system constraints.
Abstract
Efficient inference with Large Language Models (LLMs) increasingly relies on Key-Value (KV) caches to store previously computed key and value vectors at each layer. These caches are essential to minimize redundant computation during autoregressive token generation, lowering computational complexity from quadratic to linear. However, the growth of KV caches has posed significant system-level challenges, particularly as model sizes increase, context lengths grow, and concurrent requests compete for limited memory resources. Even though several recent frameworks for KV cache management have emerged, their comparative trade-offs in memory consumption and inference performance have not been fully understood, especially under varying request sizes and model configurations. In this work, we conduct an empirical study of three state-of-the-art KV cache management frameworks: vLLM, InfiniGen,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
