CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
Sean Nian, Jiahao Fang, Qilong Feng, Zhiyu Wu, Fan Lai

TL;DR
CacheFlow is a novel framework that significantly accelerates large language model serving by optimizing cache restoration through multi-dimensional parallelism and a batch-aware scheduler, reducing latency across various workloads.
Contribution
It introduces a unified 3D parallelism abstraction and a batch-aware scheduler to optimize KV cache restoration, addressing resource contention and exploiting parallelism in LLM serving.
Findings
Reduces Time-To-First-Token (TTFT) by 10%-62% across models and workloads.
Effectively overlaps recomputation and I/O, improving serving efficiency.
Supports diverse models, workloads, and hardware configurations.
Abstract
KV cache restoration has emerged as a dominant bottleneck in serving long-context LLM workloads, including multi-turn conversations, retrieval-augmented generation, and agentic pipelines. Existing approaches treat restoration as a per-request tradeoff between recomputation and I/O transfer, recomputing KV states from scratch or offloading them from external storage (e.g., CPU memory or remote machines). However, existing advances fail to exploit parallelism across tokens, layers, and distributed deployments, and critically ignore resource contention under batched serving. We present CacheFlow, a KV cache restoration framework that rethinks cache restoration as a multi-dimensional parallel execution problem. CacheFlow introduces a unified 3D parallelism abstraction across tokens, layers, and GPUs, enabling fine-grained overlap of recomputation and I/O along the structural dependencies of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
