Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, Mingxing Zhang

TL;DR
This paper introduces Prefill-as-a-Service (PrfaaS), a system architecture that enables efficient cross-datacenter serving of large language models by offloading prefill tasks to dedicated clusters, improving throughput and latency.
Contribution
PrfaaS combines model and system optimizations to enable scalable, bandwidth-aware cross-datacenter KVCache serving for large models, overcoming practical deployment challenges.
Findings
Achieves 54% higher throughput compared to baseline.
Reduces P90 TTFT by 64%.
Gains 15% throughput at similar cost.
Abstract
Prefill-decode (PD) disaggregation has become the standard architecture for large-scale LLM serving, but in practice its deployment boundary is still determined by KVCache transfer. In conventional dense-attention models, prefill generates huge KVCache traffics that keep prefill and decode tightly coupled within a single high-bandwidth network domain, limiting heterogeneous deployment and resource elasticity. Recent hybrid-attention architectures substantially reduce KVCache size, making cross-cluster KVCache transport increasingly plausible. However, smaller KVCache alone does not make heterogeneous cross-datacenter PD serving practical: real workloads remain bursty, request lengths are highly skewed, prefix caches are unevenly distributed, and inter-cluster bandwidth fluctuates. A naive design that fully externalizes prefill can therefore still suffer from congestion, unstable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
