MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems
Yifei Wang, Hancheng Ye, Yechen Xu, Cong Guo, Chiyue Wei, Qinsi Wang, Dongting Li, Tingjun Chen, Hai "Helen" Li, Danyang Zhuo, Yiran Chen

TL;DR
MARS is a co-scheduling system that efficiently manages heterogeneous GPU-CPU resources for agentic workloads, significantly reducing latency and improving throughput in large language model applications.
Contribution
It introduces a holistic, adaptive co-scheduling approach with a unified visibility system and an agent-centric scheduler to optimize resource utilization for agentic workloads.
Findings
MARS reduces end-to-end latency by up to 5.94x.
It maintains nearly maximal system throughput.
It accelerates task completion time in real-world deployments by up to 1.87x.
Abstract
Large language models (LLMs) are increasingly deployed as the execution core of autonomous agents rather than as standalone text generators. Agentic workloads induce a temporal shift from single-turn inference to multi-turn LLM-tool loops, and a spatial shift from chat-scale, GPU-only execution to repository-scale, GPU-CPU co-located execution. Consequently, coordinating heterogeneous resource demands of agentic execution has emerged as a critical system challenge. We design and implement MARS, an efficient and adaptive co-scheduling system that globally coordinates heterogeneous agentic workloads under coupled GPU-CPU resource pressure. By establishing holistic visibility across GPU inference and CPU tool execution via a unified information stream, an external control plane in MARS decouples admission from execution to prevent heterogeneous resource oversubscription. An internal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
