PCR: A Prefetch-Enhanced Cache Reuse System for Low-Latency RAG Serving
Wenfeng Wang, Xiaofeng Hou, Peng Tang, Hengyi Zhou, Jing Wang, Xinkai Wang, Chao Li, Minyi Guo

TL;DR
PCR is a system that enhances cache reuse in RAG serving by using intelligent prefetching and pipelining, significantly reducing latency and improving throughput in large language model applications.
Contribution
PCR introduces a novel prefix-tree caching, layer-wise overlapping, and queue-based prefetching approach to maximize KV-cache reuse efficiency in RAG systems.
Findings
Achieves up to 2.47x speedup in TTFT.
Outperforms existing KV-cache reuse methods.
Effectively reduces latency in high-throughput RAG serving.
Abstract
Retrieval-Augmented Generation (RAG) systems enhance the performance of large language models (LLMs) by incorporating supplementary retrieved documents, enabling more accurate and context-aware responses. However, integrating these external documents often results in very long input sequences, which significantly increases computation costs during the prefill stage, where key-value (KV) representations for all input tokens are generated. This latency bottleneck becomes especially pronounced under high-throughput serving scenarios. KV-cache reuse offers a promising solution by storing previously computed KV states for shared input prefixes, thereby avoiding redundant computation across requests that contain overlapping context. Yet, the effectiveness of cache reuse is often limited by three practical challenges: low cache hit rates due to naive eviction policies, high CPU-GPU data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Natural Language Processing Techniques
