Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

Hanchen Li; Runyuan He; Qiuyang Mang; Qizheng Zhang; Huanzhi Mao; Xiaokun Chen; Hangrui Zhou; Alvin Cheung; Joseph Gonzalez; Ion Stoica

arXiv:2511.02230·cs.OS·May 12, 2026

Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

Hanchen Li, Runyuan He, Qiuyang Mang, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonzalez, Ion Stoica

PDF

1 Repo

TL;DR

Continuum is a system that optimizes GPU cache management for multi-turn LLM agent workloads, significantly reducing job completion times by intelligently retaining cache during tool calls.

Contribution

It introduces a TTL-based KV cache retention mechanism that balances recomputation costs and queueing delays, enhancing efficiency and robustness in multi-turn LLM agent serving.

Findings

01

Over 8x reduction in average job completion times.

02

Improved throughput in real-world agent workloads.

03

Effective cache retention during tool calls enhances multi-turn continuity.

Abstract

KV cache management is essential for efficient LLM inference. To maximize utilization, existing inference engines evict finished requests' KV cache if new requests are waiting. This policy breaks for agentic workloads, which interleave LLM calls with tools, introducing pauses that prevent effective KV reuse across turns. Since many tool calls have much shorter durations than human response multi-turn chatbot, it would be promising to retain the KV cache in during these tools. However, many challenges remain. First, we need to consider both the potential cost of recomputation or reloading (if offloading enabled) as well as the increasing queueing delays after eviction from GPU. Second, due to the internal variance of tool call durations, the method needs to remain robust under limited predictability of tool call durations. We present Continuum, a serving system to optimize job…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hanchenli/vllm-continuum
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.