PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design
Xingyu Qu, Tianhao Lin, Yiqi Li, Zhiyu Chen, Sheng Wang

TL;DR
PRISM is a novel system that co-designs scheduling and memory management to optimize large language model serving, significantly reducing latency and increasing cache hit rates by exploiting prompt segmentation and hotspot skew patterns.
Contribution
It introduces PRISM, a combined scheduling and memory management approach that aligns request admission with cache retention, improving LLM serving efficiency.
Findings
Reduces average per-QPS P99 TTFT by up to 37.1%.
Increases exact-prefix KV-cache hit rate by up to 12.2 percentage points.
Achieves significant latency reduction on large models.
Abstract
Modern online large language model (LLM) services, such as Retrieval-Augmented Generation (RAG) and agent systems, increasingly expose two prominent characteristics: prompt segmentation (e.g., system instructions, retrieved passages, tool outputs) and hotspot skew, where a small set of these segments recurs frequently across user requests. Failing to jointly exploit these patterns could lead to repeated prefill of hot segments and prolonged TTFT, undermining both throughput and user-perceived responsiveness. However, existing work tackles these patterns independently: KV-cache management mainly exploits segment reuse while scheduling reorders requests to improve cache locality, yet neither aligns request admission with KV-cache retention. To address this gap, we first analyze how scheduling and KV-cache management jointly affect TTFT. Guided by this, we present PRISM (Prefix Reuse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
