Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines
Marcel Wagenl\"ander, Otto White, Britannio Jarrett, Pedro Silvestre, Yanda Tao, Guo Li, Huanzhou Zhu, Ll\'uis Vilanova, Peter Pietzuch

TL;DR
Scepsy is a system that efficiently schedules complex, multi-LLM workflows on GPU clusters by leveraging stable execution share profiles to optimize latency and throughput.
Contribution
It introduces a novel approach using aggregate LLM execution shares and a lightweight predictor to optimize GPU allocations for agentic workflows.
Findings
Achieves up to 2.4x higher throughput
Reduces latency by up to 27x
Outperforms systems optimizing LLMs independently
Abstract
Agentic workflows carry out complex tasks by orchestrating multiple large language models (LLMs) and tools. Serving such workflows at a target throughput with low latency is challenging because they can be defined using arbitrary agentic frameworks and exhibit unpredictable execution times: execution may branch, fan-out, or recur in data-dependent ways. Since LLMs in workflows often outnumber available GPUs, their execution also leads to GPU oversubscription. We describe Scepsy, a new agentic serving system that efficiently schedules arbitrary multi-LLM agentic workflows onto a GPU cluster. Scepsy exploits the insight that, while agentic workflows have unpredictable end-to-end latencies, the shares of each LLM's total execution times are comparatively stable across executions. Scepsy decides on GPU allocations based on these aggregate shares: first, it profiles the LLMs under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
