eLLM: Elastic Memory Management Framework for Efficient LLM Serving

Jiale Xu; Rui Zhang; Yi Xiong; Cong Guo; Zihan Liu; Yangjie Zhou; Weiming Hu; Hao Wu; Changxu Shao; Ziqing Wang; Yongjie Yuan; Junping Zhao; Minyi Guo; Jingwen Leng

arXiv:2506.15155·cs.DC·May 8, 2026

eLLM: Elastic Memory Management Framework for Efficient LLM Serving

Jiale Xu, Rui Zhang, Yi Xiong, Cong Guo, Zihan Liu, Yangjie Zhou, Weiming Hu, Hao Wu, Changxu Shao, Ziqing Wang, Yongjie Yuan, Junping Zhao, Minyi Guo, Jingwen Leng

PDF

TL;DR

eLLM introduces an elastic memory management framework for large language model serving, dynamically optimizing GPU and CPU memory usage to improve throughput and batch size handling under strict SLOs.

Contribution

The paper presents eLLM, a novel elastic memory management system that unifies tensor and cache management, enabling dynamic memory adjustment and improved performance for LLM serving.

Findings

01

eLLM achieves 2.32x higher decoding throughput.

02

Supports 3x larger batch sizes for 128K-token inputs.

03

Outperforms state-of-the-art systems significantly.

Abstract

Large Language Models are increasingly being deployed in datacenters. Serving these models requires careful memory management, as their memory usage includes static weights, dynamic activations, and key-value caches. While static weights are constant and predictable, dynamic components such as activations and KV caches change frequently during runtime, presenting significant challenges for efficient memory management. Modern LLM serving systems typically handle runtime memory and KV caches at distinct abstraction levels: runtime memory management relies on static tensor abstractions, whereas KV caches utilize a page table-based virtualization layer built on top of the tensor abstraction. This virtualization dynamically manages KV caches to mitigate memory fragmentation. However, this dual-level approach fundamentally isolates runtime memory and KV cache management, resulting in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.