Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture

Yu Wu; Tongxuan Liu; Yuting Zeng; Siyu Wu; Jun Xiong; Xianzhe Dong; Hailong Yang; Ke Zhang; Jing Li

arXiv:2505.11916·cs.DC·November 7, 2025

Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture

Yu Wu, Tongxuan Liu, Yuting Zeng, Siyu Wu, Jun Xiong, Xianzhe Dong, Hailong Yang, Ke Zhang, Jing Li

PDF

Open Access

TL;DR

Arrow is an adaptive scheduling system for disaggregated LLM inference that dynamically balances prefill and decode workloads, significantly improving request serving rates under variable traffic conditions.

Contribution

It introduces an adaptive scheduler that leverages latency and stateless instances to optimize resource utilization in disaggregated LLM serving architectures.

Findings

01

Achieves up to 2.55x higher request serving rates

02

Effectively handles traffic spikes and load variations

03

Improves resource utilization in LLM inference systems

Abstract

Existing large language model (LLM) serving systems typically employ Prefill-Decode disaggregated architecture to prevent computational interference between the prefill and decode phases. However, in real-world LLM serving scenarios, significant fluctuations in request input/output lengths lead to imbalanced computational loads between prefill and decode nodes under traditional static node allocation strategies, consequently preventing efficient utilization of computing resources to improve the system's goodput. To address this challenge, we design and implement Arrow, an adaptive scheduler that leverages stateless instances and latency characteristics of prefill and decode tasks to achieve efficient adaptive request and instance scheduling. Arrow dynamically adjusts the number of instances handling prefill and decode tasks based on real-time cluster performance metrics, substantially…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Natural Language Processing Techniques · Software System Performance and Reliability