DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

Lei Gao; Chaoyi Jiang; Hossein Entezari Zarch; Daniel Wong; Murali Annavaram

arXiv:2511.04791·cs.LG·November 10, 2025

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

Lei Gao, Chaoyi Jiang, Hossein Entezari Zarch, Daniel Wong, Murali Annavaram

PDF

Open Access

TL;DR

DuetServe is a novel GPU-based LLM serving framework that adaptively partitions GPU resources to optimize throughput and latency, effectively balancing prefill and decode phases without interference.

Contribution

It introduces a dynamic, fine-grained SM multiplexing approach that isolates inference phases only when contention affects latency, improving efficiency over existing methods.

Findings

01

Up to 1.3x throughput improvement

02

Maintains low latency SLOs during adaptive multiplexing

03

Effectively balances resource utilization and inference performance

Abstract

Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades time-between-tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources through duplicated models and KV cache transfers. We present DuetServe, a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU. DuetServe operates in aggregated mode by default and dynamically activates SM-level GPU spatial multiplexing when TBT degradation is predicted. Its key idea is to decouple prefill and decode execution only when needed through fine-grained, adaptive SM partitioning that provides phase…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Cloud Computing and Resource Management