SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
Dongxin Guo, Jikun Wu, Siu Ming Yiu

TL;DR
SAGA introduces a program-level scheduling approach for AI agent workflows on GPU clusters, significantly reducing latency and improving resource utilization by capturing workflow structure and request correlations.
Contribution
The paper presents SAGA, a novel distributed scheduler that treats entire AI agent workflows as first-class scheduling units, outperforming traditional call-level scheduling.
Findings
SAGA reduces task completion time by 1.64x on a 64-GPU cluster.
It improves GPU memory utilization by 1.22x.
Achieves 99.2% SLO attainment under multi-tenant interference.
Abstract
AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of B\'el\'ady's optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
