MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie,, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan

TL;DR
MemServe is a novel system that optimizes large language model serving by integrating context caching with disaggregated inference through an elastic memory pool, improving efficiency and response times.
Contribution
It introduces MemPool, an elastic memory pool, and a global scheduler that together enable combined inter-request and intra-request caching for LLM serving.
Findings
Significant reduction in job completion time.
Improved time-to-first token.
Enhanced cache reuse through global prompt tree policy.
Abstract
Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEnergy Efficient Wireless Sensor Networks · Caching and Content Delivery · Context-Aware Activity Recognition Systems
