Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving

Shi Qiu; Yifan Hu; Xintao Wang; Wenhao Zhu; Jianqin Yan; Hao Chen; Kaiqiang Xu; Kai Chen; Yiming Zhang

arXiv:2605.03375·cs.OS·May 6, 2026

Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving

Shi Qiu, Yifan Hu, Xintao Wang, Wenhao Zhu, Jianqin Yan, Hao Chen, Kaiqiang Xu, Kai Chen, Yiming Zhang

PDF

TL;DR

Tutti is a GPU-centric SSD-backed key-value cache system that significantly reduces GPU stalls and improves large language model serving performance by eliminating CPU bottlenecks in cache management.

Contribution

Tutti introduces a GPU-native object store, GPU io_uring support, and slack-aware I/O scheduling to optimize SSD-backed KV caching for LLM serving.

Findings

01

Tutti reduces TTFT by 78.3% under strict SLOs.

02

Tutti doubles the request rate compared to state-of-the-art solutions.

03

Tutti achieves near DRAM-level inference performance with almost infinite capacity.

Abstract

LLM serving relies on prefix caching to improve inference performance. As growing contexts push key-value (KV) cache footprint far beyond GPU HBM and CPU DRAM capacity, KV cache is increasingly offloaded to NVMe SSDs. Unfortunately, restoring KV cache from SSDs suffers from poor I/O performance and incurs significant GPU stalls. This is primarily because the fragmented GPU memory layout results in a massive number of tiny random I/Os, rendering the low-parallelism CPU a severe bottleneck even with GPU Direct Storage (GDS), which still relies on CPU intervention to initiate each I/O and thus remains CPU-centric. This paper presents Tutti, an efficient SSD-backed KV caching solution that eliminates CPU intervention from the critical data and I/O control paths between HBM and SSDs. At the core of Tutti is a GPU-centric KV cache object store, in which the CPU is only responsible for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.