TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving

Bingyang Wu; Zili Zhang; Yinmin Zhong; Guanzhe Huang; Yibo Zhu; Xuanzhe Liu; Xin Jin

arXiv:2508.17219·cs.DC·August 26, 2025

TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving

Bingyang Wu, Zili Zhang, Yinmin Zhong, Guanzhe Huang, Yibo Zhu, Xuanzhe Liu, Xin Jin

PDF

TL;DR

TokenLake introduces a unified segment-level prefix cache pool that enhances cache efficiency, load balancing, and reduces communication overhead in elastic long-context LLM serving, significantly improving throughput and hit rate.

Contribution

It proposes a novel unified segment-level prefix cache pool with a declarative interface and load balancing algorithm, enabling elastic request scheduling without cache management concerns.

Findings

01

Up to 2.6× throughput improvement

02

Up to 2.1× hit rate increase

03

Effective cache load balancing and deduplication

Abstract

Prefix caching is crucial to accelerate multi-turn interactions and requests with shared prefixes. At the cluster level, existing prefix caching systems are tightly coupled with request scheduling to optimize cache efficiency and computation performance together, leading to load imbalance, data redundancy, and memory fragmentation of caching systems across instances. To address these issues, memory pooling is promising to shield the scheduler from the underlying cache management so that it can focus on the computation optimization. However, because existing prefix caching systems only transfer increasingly longer prefix caches between instances, they cannot achieve low-latency memory pooling. To address these problems, we propose a unified segment-level prefix cache pool, TokenLake. It uses a declarative cache interface to expose requests' query tensors, prefix caches, and cache-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.