Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management

Haoyu Zheng; Fangcheng Fu; Jia Wu; Binhang Yuan; Yongqiang Zhang; Hao Wang; Yuanyuan Zhu; Xiao Yan; Jiawei Jiang

arXiv:2605.06472·cs.LG·May 8, 2026

Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management

Haoyu Zheng, Fangcheng Fu, Jia Wu, Binhang Yuan, Yongqiang Zhang, Hao Wang, Yuanyuan Zhu, Xiao Yan, Jiawei Jiang

PDF

TL;DR

PBKV is a system that predicts agent invocation sequences in dynamic workflows to optimize KV-Cache reuse, significantly improving serving efficiency for large language model workflows.

Contribution

It introduces a prediction-based cache management system that adapts to dynamic workflows, outperforming existing static and agent-level cache management approaches.

Findings

01

Up to 1.85x speedup over LRU on dynamic workflows.

02

Up to 1.26x speedup over KVFlow on static workflows.

03

Effective cache reuse prediction improves workflow serving efficiency.

Abstract

LLM-based workflows compose specialized agents to execute complex tasks, and these agents usually share substantial context, allowing KV-Cache reuse to save computation. Existing approaches either manage KV-Cache at agent level and fail to exploit the reuse opportunities within workflows, or manage cache at the workflow level but assume that each workflow calls a static sequence of agents. However, practical workflows are typically dynamic, where the sequence of invoked agents and thus induced cache reuse opportunities depend on the context of each task. To serve such dynamic workflows efficiently, we build a system dubbed PBKV (\textbf{P}rediction-\textbf{B}ased \textbf{KV}-Cache Management). For each workflow, PBKV predicts the agent invocations in several future steps by fusing the guidance from historical workflows and context of the target workflow. Based on the predictions, PBKV…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.