Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits

Dowon Kim; MinJae Lee; Janghyeon Kim; HyuckSung Kwon; Hyeonggyu Jeong; Sang-Soo Park; Minyong Yoon; Si-Dong Roh; Yongsuk Kwon; Jinin So; Jungwook Choi

arXiv:2511.00321·cs.AR·November 4, 2025

Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits

Dowon Kim, MinJae Lee, Janghyeon Kim, HyuckSung Kwon, Hyeonggyu Jeong, Sang-Soo Park, Minyong Yoon, Si-Dong Roh, Yongsuk Kwon, Jinin So, Jungwook Choi

PDF

Open Access

TL;DR

This paper introduces a CXL-enabled processing-near-memory system for managing large KV-caches in 1M-token LLM inference, significantly improving throughput, energy efficiency, and scalability beyond GPU limits.

Contribution

It proposes a novel CXL-based KV-cache management system with a PNM accelerator, hybrid parallelization, and steady-token selection, enabling scalable long-context LLM inference.

Findings

01

Up to 21.9x throughput improvement

02

Up to 60x lower energy per token

03

Up to 7.3x better total cost efficiency

Abstract

The expansion of context windows in large language models (LLMs) to multi-million tokens introduces severe memory and compute bottlenecks, particularly in managing the growing Key-Value (KV) cache. While Compute Express Link (CXL) enables non-eviction frameworks that offload the full KV-cache to scalable external memory, these frameworks still suffer from costly data transfers when recalling non-resident KV tokens to limited GPU memory as context lengths increase. This work proposes scalable Processing-Near-Memory (PNM) for 1M-Token LLM Inference, a CXL-enabled KV-cache management system that coordinates memory and computation beyond GPU limits. Our design offloads token page selection to a PNM accelerator within CXL memory, eliminating costly recalls and enabling larger GPU batch sizes. We further introduce a hybrid parallelization strategy and a steady-token selection mechanism to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Big Data and Digital Economy