Harnessing Your DRAM and SSD for Sustainable and Accessible LLM   Inference with Mixed-Precision and Multi-level Caching

Jie Peng; Zhang Cao; Huaizhi Qu; Zhengyu Zhang; Chang Guo; Yanyong; Zhang; Zhichao Cao; Tianlong Chen

arXiv:2410.14740·cs.LG·October 24, 2024

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching

Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong, Zhang, Zhichao Cao, Tianlong Chen

PDF

Open Access

TL;DR

This paper presents M2Cache, a multi-level caching and mixed-precision approach that enables large language model inference on resource-constrained, older GPUs, significantly reducing carbon emissions.

Contribution

It introduces a novel modularization, importance ranking, and multi-level caching system combined with dynamic sparse mixed-precision quantization for sustainable LLM inference on outdated hardware.

Findings

01

Reduces carbon emissions by enabling LLM inference on older GPUs.

02

Achieves efficient LLM serving with multi-level cache management.

03

Demonstrates feasibility of low-resource LLM inference with significant performance gains.

Abstract

Although Large Language Models (LLMs) have demonstrated remarkable capabilities, their massive parameter counts and associated extensive computing make LLMs' deployment the main part of carbon emission from nowadays AI applications. Compared to modern GPUs like H $100$ , it would be significantly carbon-sustainable if we could leverage old-fashioned GPUs such as M $40$ (as shown in Figure 1, M $40$ only has one third carbon emission of H $100$ 's) for LLM servings. However, the limited High Bandwidth Memory (HBM) available on such GPU often cannot support the loading of LLMs due to the gigantic model size and intermediate activation data, making their serving challenging. For instance, a LLaMA2 model with $70$ B parameters typically requires $128$ GB for inference, which substantially surpasses $24$ GB HBM in a $3090$ GPU and remains infeasible even considering the additional $64$ GB DRAM. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Advanced Data Storage Technologies

MethodsConcatenated Skip Connection · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Max Pooling · U-Net · Self-Supervised Deep Supervision · Non Maximum Suppression · Convolution · SSD