FlashMoE: Reducing SSD I/O Bottlenecks via ML-Based Cache Replacement for Mixture-of-Experts Inference on Edge Devices
Byeongju Kim, Jungwan Lee, Donghyeon Han, Hoi-Jun Yoo, Sangyeob Kim

TL;DR
FlashMoE introduces a machine learning-based cache management system that offloads inactive experts to SSD, significantly reducing I/O bottlenecks and enabling efficient large-scale MoE inference on memory-constrained edge devices.
Contribution
The paper presents FlashMoE, a novel system that uses ML-driven caching to offload experts to SSD, addressing the limitations of previous RAM-based solutions for large MoE models.
Findings
Up to 51% improvement in cache hit rate over LRU and LFU.
Achieves up to 2.6x speedup in MoE inference.
Demonstrates practicality on real desktop hardware.
Abstract
Recently, Mixture-of-Experts (MoE) models have gained attention for efficiently scaling large language models. Although these models are extremely large, their sparse activation enables inference to be performed by accessing only a fraction of the model at a time. This property opens the possibility of on-device inference of MoE, which was previously considered infeasible for such large models. Consequently, various systems have been proposed to leverage this sparsity and enable efficient MoE inference for edge devices. However, previous MoE inference systems like Fiddler[8] or DAOP[13] rely on DRAM-based offloading and are not suitable for memory constrained on-device environments. As recent MoE models grow to hundreds of gigabytes, RAM-offloading solutions become impractical. To address this, we propose FlashMoE, a system that offloads inactive experts to SSD, enabling efficient MoE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Big Data and Digital Economy · Caching and Content Delivery
