TL;DR
Continuum is a system that optimizes GPU cache management for multi-turn LLM agent workloads, significantly reducing job completion times by intelligently retaining cache during tool calls.
Contribution
It introduces a TTL-based KV cache retention mechanism that balances recomputation costs and queueing delays, enhancing efficiency and robustness in multi-turn LLM agent serving.
Findings
Over 8x reduction in average job completion times.
Improved throughput in real-world agent workloads.
Effective cache retention during tool calls enhances multi-turn continuity.
Abstract
KV cache management is essential for efficient LLM inference. To maximize utilization, existing inference engines evict finished requests' KV cache if new requests are waiting. This policy breaks for agentic workloads, which interleave LLM calls with tools, introducing pauses that prevent effective KV reuse across turns. Since many tool calls have much shorter durations than human response multi-turn chatbot, it would be promising to retain the KV cache in during these tools. However, many challenges remain. First, we need to consider both the potential cost of recomputation or reloading (if offloading enabled) as well as the increasing queueing delays after eviction from GPU. Second, due to the internal variance of tool call durations, the method needs to remain robust under limited predictability of tool call durations. We present Continuum, a serving system to optimize job…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
