Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving

Xiaoxiang Shi; Colin Cai; Junjia Du; Zhihao Jia

arXiv:2507.06608·cs.DC·August 8, 2025

Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving

Xiaoxiang Shi, Colin Cai, Junjia Du, Zhihao Jia

PDF

TL;DR

Nexus introduces a proactive intra-GPU disaggregation approach for LLM serving that dynamically manages resources for prefill and decode phases, significantly improving throughput and latency over existing methods.

Contribution

It presents a novel proactive resource partitioning system that adapts to workload dynamics, overcoming reactive limitations of prior intra-GPU disaggregation techniques.

Findings

01

Up to 2.2x higher throughput compared to vLLM

02

20x lower TTFT than vLLM

03

2.5x lower TBT than vLLM

Abstract

Monolithic serving with chunked prefill improves GPU utilization by batching prefill and decode together, but suffers from fine-grained phase interference. Engine-level prefill-decode (PD) disaggregation avoids interference but incurs higher hardware and coordination overhead. Prior intra-GPU disaggregation approaches multiplex prefill and decode within a single GPU, using SLO-based tuning guided by heuristics from offline profiling or reactive feedback loops. However, these methods respond reactively to performance issues rather than anticipating them, limiting adaptability under dynamic workloads. We ask: can we achieve proactive intra-GPU disaggregation that adapts effectively to dynamic workloads? The key challenge lies in managing the conflicting resource demands of prefill and decode under varying conditions. We first show that GPU resources exhibit diminishing returns -- beyond…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.