Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM
Lian Liu, Shixin Zhao, Bing Li, Haimeng Ren, Zhaohui Xu, Mengdi Wang,, Xiaowei Li, Yinhe Han, Ying Wang

TL;DR
Hermes is a cost-effective system that enhances LLM inference on consumer hardware by combining GPU and NDP-DIMM memory, leveraging activation sparsity for significant speedups.
Contribution
This work introduces Hermes, a novel heterogeneous computing system utilizing NDP-DIMMs to improve LLM inference efficiency on budget hardware.
Findings
Achieves 13.75 tokens/sec for LLaMA2-70B on consumer hardware.
Realizes an average 75.24× speedup over existing offloading systems.
Effectively manages real-time neuron partitioning and load balancing.
Abstract
The billion-scale Large Language Models (LLMs) need deployment on expensive server-grade GPUs with large-storage HBMs and abundant computation capability. As LLM-assisted services become popular, achieving cost-effective LLM inference on budget-friendly hardware becomes the trend. Extensive researches relocate LLM parameters from expensive GPUs to host memory. However, the restricted bandwidth between the host and GPU memory limits the inference performance. This work introduces Hermes, a budget-friendly system that leverages the near-data processing (NDP) within commodity DRAM DIMMs to enhance the performance of a single consumer-grade GPU, achieving efficient LLM inference. The inherent activation sparsity in LLMs naturally divides weight parameters into two categories, termed ``hot" and ``cold" neurons, respectively. Hot neurons, which consist of only approximately 20\% of all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications
