PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language   Model Inference

Yufeng Gu; Alireza Khadem; Sumanth Umesh; Ning Liang; Xavier Servot,; Onur Mutlu; Ravi Iyer; and Reetuparna Das

arXiv:2502.07578·cs.AR·May 6, 2025

PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference

Yufeng Gu, Alireza Khadem, Sumanth Umesh, Ning Liang, Xavier Servot,, Onur Mutlu, Ravi Iyer, and Reetuparna Das

PDF

1 Repo

TL;DR

This paper introduces CENT, a CXL-enabled GPU-free system that significantly improves large language model inference efficiency and cost-effectiveness by leveraging memory expansion and high bandwidth memory access.

Contribution

CENT is the first system to utilize CXL memory expansion and peer-to-peer communication for efficient GPU-free LLM inference, outperforming GPU-based systems in throughput and energy efficiency.

Findings

01

2.3× higher throughput than GPU baselines

02

2.9× less energy consumption

03

5.2× more tokens per dollar

Abstract

Large Language Model (LLM) inference uses an autoregressive manner to generate one token at a time, which exhibits notably lower operational intensity compared to earlier Machine Learning (ML) models such as encoder-only transformers and Convolutional Neural Networks. At the same time, LLMs possess large parameter sizes and use key-value caches to store context information. Modern LLMs support context windows with up to 1 million tokens to generate versatile text, audio, and video content. A large key-value cache unique to each prompt requires a large memory capacity, limiting the inference batch size. Both low operational intensity and limited batch size necessitate a high memory bandwidth. However, contemporary hardware systems for ML model deployment, such as GPUs and TPUs, are primarily optimized for compute throughput. This mismatch challenges the efficient deployment of advanced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Yufeng98/CENT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.