TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications

Zhuohang Bian; Feiyang Wu; Zhuoran Li; Teng Ma; Youwei Zhuo

arXiv:2510.18586·cs.DC·May 21, 2026

TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications

Zhuohang Bian, Feiyang Wu, Zhuoran Li, Teng Ma, Youwei Zhuo

PDF

TL;DR

TokenCake is a novel framework that enhances the performance of multi-agent applications using LLMs by optimizing cache scheduling and memory management, significantly reducing latency and improving GPU utilization.

Contribution

It introduces agent-aware scheduling and memory management techniques, including a temporal scheduler and spatial partitioning, tailored for multi-agent LLM workloads.

Findings

01

Reduces end-to-end latency by over 47%.

02

Improves GPU memory utilization by up to 16.9%.

03

Demonstrates effectiveness on representative multi-agent benchmarks.

Abstract

Large Language Models (LLMs) are increasingly deployed in complex multi-agent applications that rely on external function calls. This workload creates severe performance challenges for the KV Cache: spatial contention leads to the eviction of critical agents' caches and temporal underutilization leaves the cache of agents stalled on long-running function calls idling in GPU memory. We present TokenCake, a KV-Cache-centric serving framework that bridges this gap by co-optimizing scheduling and memory management through an agent-aware design. TokenCake's Temporal Scheduler employs an event-driven, opportunistic policy to proactively offload idle KV Caches during function calls and uses predictive uploading to hide data transfer latency. TokenCake's Spatial Scheduler uses dynamic memory partitioning, guided by a hybrid priority metric combining graph structure and runtime state, to reserve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Multimodal Machine Learning Applications