SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency
Kwanhee Kyung, Sungmin Yun, Jung Ho Ahn

TL;DR
This paper demonstrates that offloading Mixture-of-Experts weights to SSDs during LLM inference significantly increases energy consumption, and only future technological improvements could make SSDs energy-efficient for this purpose.
Contribution
It provides a quantitative analysis of the energy costs of SSD offloading for MoE weights and highlights the fundamental energy inefficiency compared to DRAM-based storage.
Findings
SSD offloading increases energy per token by up to 12x compared to HBM.
Prefetching cannot offset the fundamental energy penalty of SSD access.
Future improvements in Flash read energy could make SSDs viable for MoE models.
Abstract
Large Language Models (LLMs) applying Mixture-of-Experts (MoE) scale to trillions of parameters but require vast memory, motivating a line of research to offload expert weights from fast-but-small DRAM (HBM) to denser Flash SSDs. While SSDs provide cost-effective capacity, their read energy per bit is substantially higher than that of DRAM. This paper quantitatively analyzes the energy implications of offloading MoE expert weights to SSDs during the critical decode stage of LLM inference. Our analysis, comparing SSD, CPU memory (DDR), and HBM storage scenarios for models like DeepSeek-R1, reveals that offloading MoE weights to current SSDs drastically increases per-token-generation energy consumption (e.g., by up to ~12x compared to the HBM baseline), dominating the total inference energy budget. Although techniques like prefetching effectively hide access latency, they cannot mitigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
