ContiguousKV: Accelerating LLM Prefill with Granularity-Aligned KV Cache Management

Jing Zou; Shangyu Wu; Hancong Duan; Qiao Li; Chun Jason Xue

arXiv:2601.13631·cs.OS·January 21, 2026

ContiguousKV: Accelerating LLM Prefill with Granularity-Aligned KV Cache Management

Jing Zou, Shangyu Wu, Hancong Duan, Qiao Li, Chun Jason Xue

PDF

Open Access

TL;DR

ContiguousKV significantly accelerates LLM prefix cache offloading by aligning cache pruning with I/O granularity and prefetching, reducing bottlenecks and improving efficiency during the Re-Prefill phase.

Contribution

This work introduces ContiguousKV, a novel system that unifies cache management and I/O operations at a new granularity, enabling faster and more efficient LLM cache offloading.

Findings

01

Achieves 3.85x speedup over IMPRESS in Re-Prefill phase

02

Effectively eliminates read amplification and idle I/O bubbles

03

Maintains high output quality with improved efficiency

Abstract

Efficiently serving Large Language Models (LLMs) with persistent Prefix Key-Value (KV) Cache is critical for applications like conversational search and multi-turn dialogue. Serving a request requires loading the pre-computed prefix KV cache and generating the first token, defined as the Re-Prefill Phase. Offloading this shared prefix cache to secondary storage is essential for memory scalability. Re-Prefill with offloading suffers from severe I/O bottlenecks in two aspects. First, semantic-aware KV cache pruning algorithms select important tokens in fine granularity, while systems manage I/O in coarse, fixed-size blocks, causing severe read amplification. Second, the sequential dependency between identifying important tokens and loading KV cache creates idle I/O and compute bubbles, under-utilizing system resources. This paper proposes \textit{ContiguousKV}, a high-performance prefix…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Cloud Computing and Resource Management · Topic Modeling