Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Ruoyu Qin; Weiran He; Yaoyu Wang; Zheming Li; Xinran Xu; Yongwei Wu; Weimin Zheng; Mingxing Zhang

arXiv:2604.15039·cs.DC·April 23, 2026

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, Mingxing Zhang

PDF

TL;DR

This paper introduces Prefill-as-a-Service (PrfaaS), a system architecture that enables efficient cross-datacenter serving of large language models by offloading prefill tasks to dedicated clusters, improving throughput and latency.

Contribution

PrfaaS combines model and system optimizations to enable scalable, bandwidth-aware cross-datacenter KVCache serving for large models, overcoming practical deployment challenges.

Findings

01

Achieves 54% higher throughput compared to baseline.

02

Reduces P90 TTFT by 64%.

03

Gains 15% throughput at similar cost.

Abstract

Prefill-decode (PD) disaggregation has become the standard architecture for large-scale LLM serving, but in practice its deployment boundary is still determined by KVCache transfer. In conventional dense-attention models, prefill generates huge KVCache traffics that keep prefill and decode tightly coupled within a single high-bandwidth network domain, limiting heterogeneous deployment and resource elasticity. Recent hybrid-attention architectures substantially reduce KVCache size, making cross-cluster KVCache transport increasingly plausible. However, smaller KVCache alone does not make heterogeneous cross-datacenter PD serving practical: real workloads remain bursty, request lengths are highly skewed, prefix caches are unevenly distributed, and inter-cluster bandwidth fluctuates. A naive design that fully externalizes prefill can therefore still suffer from congestion, unstable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.