Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective

Noppanat Wadlom; Junyi Shen; Yao Lu

arXiv:2603.16104·cs.MA·March 18, 2026

Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective

Noppanat Wadlom, Junyi Shen, Yao Lu

PDF

Open Access

TL;DR

This paper introduces Helium, a novel LLM serving framework that optimizes agentic workflows by modeling them as query plans, enabling cross-call reuse and achieving significant speedups over existing systems.

Contribution

Helium is the first to treat LLM calls as first-class operators within a query plan framework, enabling cross-call optimization and proactive caching for agentic workflows.

Findings

01

Helium achieves up to 1.56x speedup over state-of-the-art systems.

02

Proactive caching and cache-aware scheduling significantly improve efficiency.

03

End-to-end workflow optimization is crucial for scalable LLM-based agents.

Abstract

Agentic workflows are composed of sequences of interdependent Large Language Model (LLM) calls, and they have become a dominant workload in modern AI systems. These workflows exhibit extensive redundancy from overlapping prompts and intermediate results due to speculative and parallel exploration. Existing LLM serving systems, such as vLLM, focus on optimizing individual inference calls and overlook cross-call dependencies, leading to significant inefficiencies. This paper rethinks LLM and agent serving from a data systems perspective and introduces Helium, a workflow-aware serving framework that models agentic workloads as query plans and treats LLM invocations as first-class operators. Helium integrates proactive caching and cache-aware scheduling to maximize reuse across prompts, KV states, and workflows. Through these techniques, Helium bridges classic query optimization principles…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Semantic Web and Ontologies · Big Data and Digital Economy