Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving

Chao Wang; Pengfei Zuo; Zhangyu Chen; Yunkai Liang; Zhou Yu; Ming-Chang Yang

arXiv:2508.01989·cs.DC·August 5, 2025

Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving

Chao Wang, Pengfei Zuo, Zhangyu Chen, Yunkai Liang, Zhou Yu, Ming-Chang Yang

PDF

Open Access

TL;DR

This paper introduces TaiChi, a unified LLM serving system that optimizes goodput by adaptively combining prefill-decode aggregation and disaggregation, using latency shifting and specialized scheduling to meet diverse service-level objectives.

Contribution

TaiChi unifies aggregation and disaggregation for LLM serving, enabling optimal goodput across various SLOs through adaptive resource allocation and latency-aware scheduling.

Findings

01

TaiChi improves goodput by up to 77% over existing systems.

02

It effectively adapts to different SLO regimes with configurable sliders.

03

Hybrid mode with latency shifting maximizes request satisfaction under balanced SLOs.

Abstract

An ongoing debate considers whether prefill-decode (PD) aggregation or disaggregation is superior for serving large language models (LLMs). This has driven optimizations for both approaches, each showing distinct advantages. This paper compares PD aggregation and disaggregation, showing that each excels under different service-level objectives (SLOs): aggregation is optimal for tight time-to-first-token (TTFT) and relaxed time-per-output-token (TPOT), while disaggregation excels for strict TPOT and relaxed TTFT. However, under balanced TTFT and TPOT SLOs, neither approach delivers optimal goodput. This paper proposes TaiChi, an LLM serving system that unifies PD disaggregation and aggregation for optimal goodput under any combination of TTFT and TPOT SLOs. TaiChi uses a unified disaggregation-aggregation architecture with differentiated-capability GPU instances: prefill-heavy (fast…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Big Data and Digital Economy · Scientific Computing and Data Management