AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU
Yuning Zhang, Yan Yan, Nan Yang, Dong Yuan

TL;DR
AgentServe is a novel GPU serving system designed for agentic AI workloads, effectively managing resource contention between long prefills and short decodes to ensure low latency and stable multi-agent performance.
Contribution
It introduces a co-designed algorithm-system approach that isolates and dynamically manages GPU resources for different workload phases, improving stability and efficiency.
Findings
Up to 2.8x TTFT improvement over baselines
Up to 2.7x TPOT improvement over baselines
Significant latency stability enhancement across settings
Abstract
Large language models (LLMs) are increasingly deployed as AI agents that operate in short reasoning-action loops, interleaving model computation with external calls. Unlike traditional chat applications, these agentic workloads require inference serving systems to balance low latency, stable token emission, and throughput under multiple request arrivals from different AI agents. Recent deployments highlight a shift toward running small language models (SLMs) locally on consumer-grade GPUs, driven by privacy, compliance, and cost constraints. When heterogeneous requests overlap on a single GPU, long prefills and short decodes contend for resources, creating head-of-line blocking that destabilizes interactive performance. By analyzing agent workloads, we observe that their execution naturally separates into cold prefills, which process long system prompts, resume prefills, which append…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Multimodal Machine Learning Applications · Parallel Computing and Optimization Techniques
