Efficient Multi-round LLM Inference over Disaggregated Serving

Wenhao He; Youhe Jiang; Penghao Zhao; Quanqing Xu; Eiko Yoneki; Bin Cui; Fangcheng Fu

arXiv:2602.14516·cs.DC·February 17, 2026

Efficient Multi-round LLM Inference over Disaggregated Serving

Wenhao He, Youhe Jiang, Penghao Zhao, Quanqing Xu, Eiko Yoneki, Bin Cui, Fangcheng Fu

PDF

Open Access

TL;DR

This paper introduces AMPD, a novel disaggregated serving framework that adaptively manages multi-round LLM inference workloads, significantly enhancing service performance by optimizing resource scheduling and workload coordination.

Contribution

AMPD is the first framework to explicitly handle interleaved prefill-decode workloads in multi-round LLM inference, improving SLO attainment through adaptive scheduling and resource planning.

Findings

01

Substantial improvement in SLO attainment over baselines

02

Effective coordination of prefill workloads in multi-round inference

03

Adaptive scheduling enhances resource utilization

Abstract

With the rapid evolution of Large Language Models (LLMs), multi-round workflows, such as autonomous agents and iterative retrieval, have become increasingly prevalent. However, this raises hurdles for serving LLMs under prefill-decode (PD) disaggregation, a widely adopted paradigm that separates the compute-bound prefill phase and memory-bound decode phase onto individual resources. Specifically, existing systems overlook the interleaved prefill-decode workload pattern in multi-round inference, leading to sub-optimal handling of the incremental prefill workloads and model deployment for the two phases. In this work, we present AMPD, a brand new disaggregated serving framework for multi-round LLM inference. The core of AMPD is to coordinate the prefill workloads based on real-time workloads by adaptively determining where to carry out these workloads and how they are scheduled, in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Software System Performance and Reliability · Explainable Artificial Intelligence (XAI)