Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective
Ritik Raj, Souvik Kundu, Ishita Vohra, Hong Wang, Tushar Krishna

TL;DR
This paper analyzes CPU-centric bottlenecks in agentic AI workloads and proposes scheduling optimizations to improve latency and resource utilization on heterogeneous systems.
Contribution
It provides a detailed characterization of agentic AI execution bottlenecks and introduces two scheduling methods, COMB and MAS, to optimize performance.
Findings
COMB reduces P50 latency by up to 1.7x in homogeneous workloads.
COMB achieves up to 3.9x/1.8x lower service/total latency under load.
MAS reduces total latency for minority request types by up to 2.37x/2.49x at P50/P90.
Abstract
Agentic AI serving converts monolithic LLM-based inference to autonomous problem-solvers that can plan, call tools, perform reasoning, and adapt on the fly. Due to diverse task execution need, such serving heavily rely on heterogeneous CPU-GPU systems with majority of the external tools responsible for agentic capability, either run on or are orchestrated by the CPU. Towards having a deeper understanding of its role, this paper aims to characterize and analyze the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU-centric perspective. We first present a compile-time characterization of agentic AI execution and choose representative workloads to capture the algorithmic diversity. We then perform runtime characterization of the representative workloads analyzing the end-to-end latency and throughput on two different hardware systems to isolate respective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
