Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
Shi Qiu, Yifan Hu, Xintao Wang, Wenhao Zhu, Jianqin Yan, Hao Chen, Kaiqiang Xu, Kai Chen, Yiming Zhang

TL;DR
Tutti is a GPU-centric SSD-backed key-value cache system that significantly reduces GPU stalls and improves large language model serving performance by eliminating CPU bottlenecks in cache management.
Contribution
Tutti introduces a GPU-native object store, GPU io_uring support, and slack-aware I/O scheduling to optimize SSD-backed KV caching for LLM serving.
Findings
Tutti reduces TTFT by 78.3% under strict SLOs.
Tutti doubles the request rate compared to state-of-the-art solutions.
Tutti achieves near DRAM-level inference performance with almost infinite capacity.
Abstract
LLM serving relies on prefix caching to improve inference performance. As growing contexts push key-value (KV) cache footprint far beyond GPU HBM and CPU DRAM capacity, KV cache is increasingly offloaded to NVMe SSDs. Unfortunately, restoring KV cache from SSDs suffers from poor I/O performance and incurs significant GPU stalls. This is primarily because the fragmented GPU memory layout results in a massive number of tiny random I/Os, rendering the low-parallelism CPU a severe bottleneck even with GPU Direct Storage (GDS), which still relies on CPU intervention to initiate each I/O and thus remains CPU-centric. This paper presents Tutti, an efficient SSD-backed KV caching solution that eliminates CPU intervention from the critical data and I/O control paths between HBM and SSDs. At the core of Tutti is a GPU-centric KV cache object store, in which the CPU is only responsible for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
