Pythia: Exploiting Workflow Predictability for Efficient Agent-Native LLM Serving
Shan Yu, Junyi Shu, Yuanjiang Ni, Kun Qian, Xue Li, Yang Wang, Jinyuan Zhang, Ziyi Xu, Shuo Yang, Lingjun Zhu, Ennan Zhai, Qingda Lu, Jiarong Xing, Youyou Lu, Xin Jin, Xuanzhe Liu, and Harry Xu

TL;DR
Pythia leverages the structured predictability of multi-agent workflows to optimize LLM serving, significantly enhancing throughput and reducing delays compared to traditional systems.
Contribution
This paper introduces Pythia, a novel multi-agent serving system that exploits workflow semantics to improve efficiency in LLM deployment.
Findings
Identified key bottlenecks like low cache hit rates and resource contention.
Demonstrated Pythia's substantial improvements over existing baselines.
Analyzed production traces to inform system design.
Abstract
As LLM applications grow more complex, developers are increasingly adopting multi-agent architectures to decompose workflows into specialized, collaborative components, introducing structure that constrains agent behavior and exposes useful semantic predictability. Unlike traditional LLM serving, which operates under highly dynamic and uncertain conditions, this structured topology enables opportunities to reduce runtime uncertaintyyet existing systems fail to exploit it, treating agentic workloads as generic traffic and incurring significant inefficiencies. Our analysis of production traces from an agent-serving platform and an internal coding assistant reveals key bottlenecks, including low prefix cache hit rates, severe resource contention from long-context requests, and substantial queuing delays due to suboptimal scaling. To address these challenges, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
