Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with   NDP-DIMM

Lian Liu; Shixin Zhao; Bing Li; Haimeng Ren; Zhaohui Xu; Mengdi Wang,; Xiaowei Li; Yinhe Han; Ying Wang

arXiv:2502.16963·cs.AR·February 25, 2025

Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM

Lian Liu, Shixin Zhao, Bing Li, Haimeng Ren, Zhaohui Xu, Mengdi Wang,, Xiaowei Li, Yinhe Han, Ying Wang

PDF

Open Access

TL;DR

Hermes is a cost-effective system that enhances LLM inference on consumer hardware by combining GPU and NDP-DIMM memory, leveraging activation sparsity for significant speedups.

Contribution

This work introduces Hermes, a novel heterogeneous computing system utilizing NDP-DIMMs to improve LLM inference efficiency on budget hardware.

Findings

01

Achieves 13.75 tokens/sec for LLaMA2-70B on consumer hardware.

02

Realizes an average 75.24× speedup over existing offloading systems.

03

Effectively manages real-time neuron partitioning and load balancing.

Abstract

The billion-scale Large Language Models (LLMs) need deployment on expensive server-grade GPUs with large-storage HBMs and abundant computation capability. As LLM-assisted services become popular, achieving cost-effective LLM inference on budget-friendly hardware becomes the trend. Extensive researches relocate LLM parameters from expensive GPUs to host memory. However, the restricted bandwidth between the host and GPU memory limits the inference performance. This work introduces Hermes, a budget-friendly system that leverages the near-data processing (NDP) within commodity DRAM DIMMs to enhance the performance of a single consumer-grade GPU, achieving efficient LLM inference. The inherent activation sparsity in LLMs naturally divides weight parameters into two categories, termed ``hot" and ``cold" neurons, respectively. Hot neurons, which consist of only approximately 20\% of all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications