KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

Jian Lin; Jiazhi Mi; Zicong Hong; Haodong Wang; Qianli Liu; Haodyue Zhang; Peng Li; Song Guo

arXiv:2605.18071·cs.CL·May 19, 2026

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

Jian Lin, Jiazhi Mi, Zicong Hong, Haodong Wang, Qianli Liu, Haodyue Zhang, Peng Li, Song Guo

PDF

TL;DR

KVDrive is a multi-tier cache management system that optimizes long-context LLM inference by orchestrating cache placement, pipeline scheduling, and cross-tier coordination, significantly improving throughput.

Contribution

It introduces a systems-level approach to manage cache and data movement across GPU, host DRAM, and SSD for scalable long-context inference.

Findings

01

Achieves up to 1.74x higher throughput than state-of-the-art.

02

Effectively manages cache to maximize reuse and reduce data movement.

03

Eliminates stalls by restructuring the decoding pipeline.

Abstract

Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding, but this strategy quickly hits a ceiling: sparsity cannot be pushed further without degrading accuracy. As a result, when context length and batch size grow, the volume of KV transfers rises sharply and becomes the dominant source of decoding latency. We present KVDrive, a holistic multi-tier KV cache management system spanning GPU memory, host DRAM, and SSD. Unlike prior work that pursues greater sparsity through algorithmic refinements, KVDrive tackles the problem from a systems perspective - jointly orchestrating cache placement, pipeline scheduling, and cross-tier coordination to sustain high-throughput inference under tight GPU budgets. KVDrive advances…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.