HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling

You Peng; Youhe Jiang; Wenshuang Li; Xu Xu; Ke Zhou; Jiawei Jiang; Chen Wang; Binhang Yuan

arXiv:2605.16637·cs.DC·May 19, 2026

HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling

You Peng, Youhe Jiang, Wenshuang Li, Xu Xu, Ke Zhou, Jiawei Jiang, Chen Wang, Binhang Yuan

PDF

TL;DR

HexAGenT is a novel workflow-aware scheduler designed for heterogeneous LLM serving clusters, significantly reducing latency and improving efficiency in agentic multi-step workflows.

Contribution

The paper introduces HexAGenT, a new scheduling algorithm that models workflows as DAGs and optimizes placement and prioritization across diverse GPU clusters.

Findings

01

HexAGenT reduces SLO scale by up to 80.5% in representative workloads.

02

It achieves an average of 33.0% reduction at 99% attainment.

03

The scheduler effectively manages heterogeneous GPU resources for complex workflows.

Abstract

Agentic LLM applications increasingly execute user requests as multi-step workflows involving planning, tool use, branching, refinement, and synthesis. In such settings, users experience the end-to-end latency of an entire workflow, not the latency of any single LLM call. In this paper, we study how to schedule online agentic workflows across heterogeneous prefill-decode disaggregated LLM serving clusters to efficiently meet workflow-level latency objectives. The problem is challenging because workflow dependencies are revealed incrementally at runtime, calls have heterogeneous prompts, outputs, and KV-cache requirements, and the prefill and decode stages impose different compute, memory, and transfer constraints across heterogeneous GPUs. To solve this problem, we present HexAGenT, a workflow-aware scheduler for a heterogeneous prefill-decode inference service. HexAGenT models each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.