GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources
Boxiao Du, Boning Huangfu, Yizhou Luo, Chen Chen, Zijun Li, Minchen Yu, Xiaoyi Fan, Minyi Guo

TL;DR
GoodServe is a system designed to optimize the routing and management of agentic LLM inference requests over heterogeneous GPU resources, significantly improving throughput and meeting latency requirements.
Contribution
It introduces a predict-and-rectify inference routing approach that estimates request lengths and GPU status, enabling high-quality routing and dynamic request migration.
Findings
GoodServe improves goodput by up to 27.4% over existing methods.
It accurately estimates request output lengths and GPU status for better routing.
The system effectively manages runtime request migrations to meet latency SLOs.
Abstract
Large Language Models (LLMs) play a critical role in emerging agentic applications, where the timely completion of each entire inference is critical. Meanwhile, agentic LLM inferences are increasingly served on heterogeneous GPUs in operator's resource pools. Therefore, it is crucial to route incoming inference requests to appropriate GPUs so that their end-to-end latency requirements are satisfied whenever possible, thereby achieving high goodput. In this paper, we propose GoodServe, a goodput-optimized serving system for agentic inferences over heterogeneous resources. GoodServe performs inference routing in a predict-and-rectify manner. It estimates the request output lengths as well as the GPU serving status in an accurate and also practical manner. Based on information from both the demand and resource sides, it then makes high-quality routing decisions using a just-enough instance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
