ContiguousKV: Accelerating LLM Prefill with Granularity-Aligned KV Cache Management
Jing Zou, Shangyu Wu, Hancong Duan, Qiao Li, Chun Jason Xue

TL;DR
ContiguousKV significantly accelerates LLM prefix cache offloading by aligning cache pruning with I/O granularity and prefetching, reducing bottlenecks and improving efficiency during the Re-Prefill phase.
Contribution
This work introduces ContiguousKV, a novel system that unifies cache management and I/O operations at a new granularity, enabling faster and more efficient LLM cache offloading.
Findings
Achieves 3.85x speedup over IMPRESS in Re-Prefill phase
Effectively eliminates read amplification and idle I/O bubbles
Maintains high output quality with improved efficiency
Abstract
Efficiently serving Large Language Models (LLMs) with persistent Prefix Key-Value (KV) Cache is critical for applications like conversational search and multi-turn dialogue. Serving a request requires loading the pre-computed prefix KV cache and generating the first token, defined as the Re-Prefill Phase. Offloading this shared prefix cache to secondary storage is essential for memory scalability. Re-Prefill with offloading suffers from severe I/O bottlenecks in two aspects. First, semantic-aware KV cache pruning algorithms select important tokens in fine granularity, while systems manage I/O in coarse, fixed-size blocks, causing severe read amplification. Second, the sequential dependency between identifying important tokens and loading KV cache creates idle I/O and compute bubbles, under-utilizing system resources. This paper proposes \textit{ContiguousKV}, a high-performance prefix…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Cloud Computing and Resource Management · Topic Modeling
