Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference

Yue Zhu; Hao Yu; Chen Wang; Zhuoran Liu; Eun Kyung Lee

arXiv:2505.21919·cs.ET·May 29, 2025

Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference

Yue Zhu, Hao Yu, Chen Wang, Zhuoran Liu, Eun Kyung Lee

PDF

Open Access

TL;DR

This paper analyzes real-world key-value cache access patterns in large language model inference workloads, highlighting the need for specialized, efficient distributed caching systems to optimize performance and reduce redundancy.

Contribution

It provides an in-depth analysis of KVC access patterns and evaluates existing storage solutions, emphasizing the necessity for tailored caching systems for LLM inference.

Findings

01

Existing key-value stores lack optimization for KVC prefilling in LLMs.

02

High cache reusability in RAG and agent workloads impacts cache management strategies.

03

Efficient distributed caching with optimized metadata management can improve inference scalability and latency.

Abstract

The increasing adoption of large language models (LLMs) with extended context windows necessitates efficient Key-Value Cache (KVC) management to optimize inference performance. Inference workloads like Retrieval-Augmented Generation (RAG) and agents exhibit high cache reusability, making efficient caching critical to reducing redundancy and improving speed. We analyze real-world KVC access patterns using publicly available traces and evaluate commercial key-value stores like Redis and state-of-the-art RDMA-based systems (CHIME [1] and Sherman [2]) for KVC metadata management. Our work demonstrates the lack of tailored storage solution for KVC prefilling, underscores the need for an efficient distributed caching system with optimized metadata management for LLM workloads, and provides insights into designing improved KVC management systems for scalable, low-latency inference.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Data Quality and Management · Topic Modeling