SMoE: An Algorithm-System Co-Design for Pushing MoE to the Edge via Expert Substitution

Guoying Zhu; Meng Li; Haipeng Dai; Xuechen Liu; Weijun Wang; Keran Li; Jun xiao; Ligeng Chen; Wei Wang

arXiv:2508.18983·cs.AI·May 5, 2026

SMoE: An Algorithm-System Co-Design for Pushing MoE to the Edge via Expert Substitution

Guoying Zhu, Meng Li, Haipeng Dai, Xuechen Liu, Weijun Wang, Keran Li, Jun xiao, Ligeng Chen, Wei Wang

PDF

TL;DR

This paper presents SMoE, a co-designed algorithm-system approach that reduces memory and latency for Mixture of Experts models on edge devices by expert substitution and efficient scheduling.

Contribution

It introduces expert importance-guided substitution and a reuse-maximizing scheduling policy to improve MoE deployment on resource-constrained edge hardware.

Findings

01

48% lower decoding latency compared to baseline

02

Over 60% expert cache hit rate achieved

03

Maintains nearly lossless accuracy

Abstract

The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.