HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
You Peng, Youhe Jiang, Wenshuang Li, Xu Xu, Ke Zhou, Jiawei Jiang, Chen Wang, Binhang Yuan

TL;DR
HexAGenT is a novel workflow-aware scheduler designed for heterogeneous LLM serving clusters, significantly reducing latency and improving efficiency in agentic multi-step workflows.
Contribution
The paper introduces HexAGenT, a new scheduling algorithm that models workflows as DAGs and optimizes placement and prioritization across diverse GPU clusters.
Findings
HexAGenT reduces SLO scale by up to 80.5% in representative workloads.
It achieves an average of 33.0% reduction at 99% attainment.
The scheduler effectively manages heterogeneous GPU resources for complex workflows.
Abstract
Agentic LLM applications increasingly execute user requests as multi-step workflows involving planning, tool use, branching, refinement, and synthesis. In such settings, users experience the end-to-end latency of an entire workflow, not the latency of any single LLM call. In this paper, we study how to schedule online agentic workflows across heterogeneous prefill-decode disaggregated LLM serving clusters to efficiently meet workflow-level latency objectives. The problem is challenging because workflow dependencies are revealed incrementally at runtime, calls have heterogeneous prompts, outputs, and KV-cache requirements, and the prefill and decode stages impose different compute, memory, and transfer constraints across heterogeneous GPUs. To solve this problem, we present HexAGenT, a workflow-aware scheduler for a heterogeneous prefill-decode inference service. HexAGenT models each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
