Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving
Zongze Li, Jingyu Liu, Zhen Xu, Yineng Zhang, Tahseen Rabbani, Ce Zhang

TL;DR
This paper introduces PPD disaggregation, a dynamic routing system for multi-turn LLM inference that reduces latency and bandwidth usage by selectively processing turns locally, improving performance under high load.
Contribution
The paper proposes PPD disaggregation, a novel adaptive routing approach that optimizes multi-turn LLM serving by reducing latency and bandwidth bottlenecks.
Findings
PPD reduces Turn 2+ TTFT by approximately 68%.
PPD maintains competitive TPOT while improving latency.
Dynamic routing adapts to varying SLOs effectively.
Abstract
Prefill-Decode (PD) disaggregation has become the standard architecture for modern LLM inference engines, which alleviates the interference of two distinctive workloads. With the growing demand for multi-turn interactions in chatbots and agentic systems, we re-examined PD in this case and found two fundamental inefficiencies: (1) every turn requires prefilling the new prompt and response from the last turn, and (2) repeated KV transfers between prefill and decode nodes saturate the bandwidth, leading to high latency and even service degradation. Our key insight is that not all prefill operations are equally disruptive: append-prefill, which processes only the new input tokens while reusing cached KV states, incurs an order-of-magnitude smaller decoding slowdown than full prefill. This motivates routing append-prefill to decode nodes locally. However, through comprehensive analysis, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
