KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference
Jian Lin, Jiazhi Mi, Zicong Hong, Haodong Wang, Qianli Liu, Haodyue Zhang, Peng Li, Song Guo

TL;DR
KVDrive is a multi-tier cache management system that optimizes long-context LLM inference by orchestrating cache placement, pipeline scheduling, and cross-tier coordination, significantly improving throughput.
Contribution
It introduces a systems-level approach to manage cache and data movement across GPU, host DRAM, and SSD for scalable long-context inference.
Findings
Achieves up to 1.74x higher throughput than state-of-the-art.
Effectively manages cache to maximize reuse and reduce data movement.
Eliminates stalls by restructuring the decoding pipeline.
Abstract
Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding, but this strategy quickly hits a ceiling: sparsity cannot be pushed further without degrading accuracy. As a result, when context length and batch size grow, the volume of KV transfers rises sharply and becomes the dominant source of decoding latency. We present KVDrive, a holistic multi-tier KV cache management system spanning GPU memory, host DRAM, and SSD. Unlike prior work that pursues greater sparsity through algorithmic refinements, KVDrive tackles the problem from a systems perspective - jointly orchestrating cache placement, pipeline scheduling, and cross-tier coordination to sustain high-throughput inference under tight GPU budgets. KVDrive advances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
