FlashMoE: Reducing SSD I/O Bottlenecks via ML-Based Cache Replacement for Mixture-of-Experts Inference on Edge Devices

Byeongju Kim; Jungwan Lee; Donghyeon Han; Hoi-Jun Yoo; Sangyeob Kim

arXiv:2601.17063·cs.LG·January 27, 2026

FlashMoE: Reducing SSD I/O Bottlenecks via ML-Based Cache Replacement for Mixture-of-Experts Inference on Edge Devices

Byeongju Kim, Jungwan Lee, Donghyeon Han, Hoi-Jun Yoo, Sangyeob Kim

PDF

Open Access

TL;DR

FlashMoE introduces a machine learning-based cache management system that offloads inactive experts to SSD, significantly reducing I/O bottlenecks and enabling efficient large-scale MoE inference on memory-constrained edge devices.

Contribution

The paper presents FlashMoE, a novel system that uses ML-driven caching to offload experts to SSD, addressing the limitations of previous RAM-based solutions for large MoE models.

Findings

01

Up to 51% improvement in cache hit rate over LRU and LFU.

02

Achieves up to 2.6x speedup in MoE inference.

03

Demonstrates practicality on real desktop hardware.

Abstract

Recently, Mixture-of-Experts (MoE) models have gained attention for efficiently scaling large language models. Although these models are extremely large, their sparse activation enables inference to be performed by accessing only a fraction of the model at a time. This property opens the possibility of on-device inference of MoE, which was previously considered infeasible for such large models. Consequently, various systems have been proposed to leverage this sparsity and enable efficient MoE inference for edge devices. However, previous MoE inference systems like Fiddler[8] or DAOP[13] rely on DRAM-based offloading and are not suitable for memory constrained on-device environments. As recent MoE models grow to hundreds of gigabytes, RAM-offloading solutions become impractical. To address this, we propose FlashMoE, a system that offloads inactive experts to SSD, enabling efficient MoE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Big Data and Digital Economy · Caching and Content Delivery