MemServe: Context Caching for Disaggregated LLM Serving with Elastic   Memory Pool

Cunchen Hu; Heyang Huang; Junhao Hu; Jiang Xu; Xusheng Chen; Tao Xie,; Chenxi Wang; Sa Wang; Yungang Bao; Ninghui Sun; Yizhou Shan

arXiv:2406.17565·cs.DC·December 24, 2024·2 cites

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie,, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan

PDF

Open Access

TL;DR

MemServe is a novel system that optimizes large language model serving by integrating context caching with disaggregated inference through an elastic memory pool, improving efficiency and response times.

Contribution

It introduces MemPool, an elastic memory pool, and a global scheduler that together enable combined inter-request and intra-request caching for LLM serving.

Findings

01

Significant reduction in job completion time.

02

Improved time-to-first token.

03

Enhanced cache reuse through global prompt tree policy.

Abstract

Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEnergy Efficient Wireless Sensor Networks · Caching and Content Delivery · Context-Aware Activity Recognition Systems