Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture
Yu Wu, Tongxuan Liu, Yuting Zeng, Siyu Wu, Jun Xiong, Xianzhe Dong, Hailong Yang, Ke Zhang, Jing Li

TL;DR
Arrow is an adaptive scheduling system for disaggregated LLM inference that dynamically balances prefill and decode workloads, significantly improving request serving rates under variable traffic conditions.
Contribution
It introduces an adaptive scheduler that leverages latency and stateless instances to optimize resource utilization in disaggregated LLM serving architectures.
Findings
Achieves up to 2.55x higher request serving rates
Effectively handles traffic spikes and load variations
Improves resource utilization in LLM inference systems
Abstract
Existing large language model (LLM) serving systems typically employ Prefill-Decode disaggregated architecture to prevent computational interference between the prefill and decode phases. However, in real-world LLM serving scenarios, significant fluctuations in request input/output lengths lead to imbalanced computational loads between prefill and decode nodes under traditional static node allocation strategies, consequently preventing efficient utilization of computing resources to improve the system's goodput. To address this challenge, we design and implement Arrow, an adaptive scheduler that leverages stateless instances and latency characteristics of prefill and decode tasks to achieve efficient adaptive request and instance scheduling. Arrow dynamically adjusts the number of instances handling prefill and decode tasks based on real-time cluster performance metrics, substantially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Natural Language Processing Techniques · Software System Performance and Reliability
