MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios
Shuhuai Li, Jianghao Lin, Dongdong Ge, Yinyu Ye

TL;DR
MoE-SpAc introduces a novel memory management framework for mixture-of-experts models on edge devices, leveraging speculative decoding as a lookahead sensor to optimize expert activation and improve inference efficiency.
Contribution
It repurposes speculative decoding as a lookahead sensor for memory management in MoE models, integrating demand estimation, dynamic workload balancing, and asynchronous execution.
Findings
Achieves 42% TPS improvement over state-of-the-art SD-based baseline.
Realizes an average 4.04x speedup over standard baselines.
Demonstrates effectiveness across seven benchmark datasets.
Abstract
Mixture-of-Experts (MoE) models enable scalable performance but face severe memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low-information nature of autoregressive expert activation. In this paper, we propose to repurpose Speculative Decoding (SD) not merely as a compute accelerator, but as an informative lookahead sensor for memory management, supported by our theoretical and empirical analyses. Hence, we introduce MoE-SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space. Extensive experiments on seven benchmarks demonstrate that MoE-SpAc achieves a 42% improvement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Big Data and Digital Economy · Cloud Computing and Resource Management
