Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management
Xinjun Yang, Qingda Hu, Junru Li, Feifei Li, Yicong Zhu, Yuqi Zhou, Qiuru Lin, Jian Dai, Yang Kong, Jiayu Zhang, Guoqiang Xu, Qiang Liu

TL;DR
Beluga introduces a CXL-based memory architecture that allows GPUs and CPUs to share large-scale memory pools with near-local latency, significantly improving LLM KVCache management efficiency.
Contribution
It is the first system enabling direct GPU access to large-scale memory pools via CXL switches, reducing latency and complexity in KVCache management.
Findings
Achieves 89.6% reduction in Time-To-First-Token (TTFT)
7.35x throughput improvement in vLLM inference
Provides design guidelines for CXL-based memory pools
Abstract
The rapid increase in LLM model sizes and the growing demand for long-context inference have made memory a critical bottleneck in GPU-accelerated serving systems. Although high-bandwidth memory (HBM) on GPUs offers fast access, its limited capacity necessitates reliance on host memory (CPU DRAM) to support larger working sets such as the KVCache. However, the maximum DRAM capacity is constrained by the limited number of memory channels per CPU socket. To overcome this limitation, current systems often adopt RDMA-based disaggregated memory pools, which introduce significant challenges including high access latency, complex communication protocols, and synchronization overhead. Fortunately, the emerging CXL technology introduces new opportunities in KVCache design. In this paper, we propose Beluga, a novel memory architecture that enables GPUs and CPUs to access a shared, large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Graph Theory and Algorithms
