AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU

Yuning Zhang; Yan Yan; Nan Yang; Dong Yuan

arXiv:2603.10342·cs.DC·March 12, 2026

AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU

Yuning Zhang, Yan Yan, Nan Yang, Dong Yuan

PDF

Open Access

TL;DR

AgentServe is a novel GPU serving system designed for agentic AI workloads, effectively managing resource contention between long prefills and short decodes to ensure low latency and stable multi-agent performance.

Contribution

It introduces a co-designed algorithm-system approach that isolates and dynamically manages GPU resources for different workload phases, improving stability and efficiency.

Findings

01

Up to 2.8x TTFT improvement over baselines

02

Up to 2.7x TPOT improvement over baselines

03

Significant latency stability enhancement across settings

Abstract

Large language models (LLMs) are increasingly deployed as AI agents that operate in short reasoning-action loops, interleaving model computation with external calls. Unlike traditional chat applications, these agentic workloads require inference serving systems to balance low latency, stable token emission, and throughput under multiple request arrivals from different AI agents. Recent deployments highlight a shift toward running small language models (SLMs) locally on consumer-grade GPUs, driven by privacy, compliance, and cost constraints. When heterogeneous requests overlap on a single GPU, long prefills and short decodes contend for resources, creating head-of-line blocking that destabilizes interactive performance. By analyzing agent workloads, we observe that their execution naturally separates into cold prefills, which process long system prompts, resume prefills, which append…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Multimodal Machine Learning Applications · Parallel Computing and Optimization Techniques