Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving
Xiaoxiang Shi, Colin Cai, Junjia Du, Zhihao Jia

TL;DR
Nexus introduces a proactive intra-GPU disaggregation approach for LLM serving that dynamically manages resources for prefill and decode phases, significantly improving throughput and latency over existing methods.
Contribution
It presents a novel proactive resource partitioning system that adapts to workload dynamics, overcoming reactive limitations of prior intra-GPU disaggregation techniques.
Findings
Up to 2.2x higher throughput compared to vLLM
20x lower TTFT than vLLM
2.5x lower TBT than vLLM
Abstract
Monolithic serving with chunked prefill improves GPU utilization by batching prefill and decode together, but suffers from fine-grained phase interference. Engine-level prefill-decode (PD) disaggregation avoids interference but incurs higher hardware and coordination overhead. Prior intra-GPU disaggregation approaches multiplex prefill and decode within a single GPU, using SLO-based tuning guided by heuristics from offline profiling or reactive feedback loops. However, these methods respond reactively to performance issues rather than anticipating them, limiting adaptability under dynamic workloads. We ask: can we achieve proactive intra-GPU disaggregation that adapts effectively to dynamic workloads? The key challenge lies in managing the conflicting resource demands of prefill and decode under varying conditions. We first show that GPU resources exhibit diminishing returns -- beyond…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
