Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching
Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong, Zhang, Zhichao Cao, Tianlong Chen

TL;DR
This paper presents M2Cache, a multi-level caching and mixed-precision approach that enables large language model inference on resource-constrained, older GPUs, significantly reducing carbon emissions.
Contribution
It introduces a novel modularization, importance ranking, and multi-level caching system combined with dynamic sparse mixed-precision quantization for sustainable LLM inference on outdated hardware.
Findings
Reduces carbon emissions by enabling LLM inference on older GPUs.
Achieves efficient LLM serving with multi-level cache management.
Demonstrates feasibility of low-resource LLM inference with significant performance gains.
Abstract
Although Large Language Models (LLMs) have demonstrated remarkable capabilities, their massive parameter counts and associated extensive computing make LLMs' deployment the main part of carbon emission from nowadays AI applications. Compared to modern GPUs like H, it would be significantly carbon-sustainable if we could leverage old-fashioned GPUs such as M (as shown in Figure 1, M only has one third carbon emission of H's) for LLM servings. However, the limited High Bandwidth Memory (HBM) available on such GPU often cannot support the loading of LLMs due to the gigantic model size and intermediate activation data, making their serving challenging. For instance, a LLaMA2 model with B parameters typically requires GB for inference, which substantially surpasses GB HBM in a GPU and remains infeasible even considering the additional GB DRAM. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Advanced Data Storage Technologies
MethodsConcatenated Skip Connection · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Max Pooling · U-Net · Self-Supervised Deep Supervision · Non Maximum Suppression · Convolution · SSD
