MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts
Wenfeng Wang, Jiacheng Liu, Xiaofeng Hou, Xinfeng Xia, Peng Tang, Mingxuan Zhang, Chao Li, Minyi Guo

TL;DR
MoE-SpeQ introduces a speculative execution and prefetching system for MoE models that significantly reduces I/O bottlenecks, enabling faster inference on memory-limited devices by overlapping computation with data transfer.
Contribution
The paper presents MoE-SpeQ, a novel co-designed system that predicts expert requirements to prefetch data, effectively hiding I/O latency and improving inference speed on constrained hardware.
Findings
Achieves up to 2.34x speedup over existing offloading methods.
Demonstrates effective hiding of I/O latency through speculative prefetching.
Provides a principled approach for data-dependent memory management in resource-limited environments.
Abstract
The immense memory requirements of state-of-the-art Mixture-of-Experts (MoE) models present a significant challenge for inference, often exceeding the capacity of a single accelerator. While offloading experts to host memory is a common solution, it introduces a severe I/O bottleneck over the PCIe bus, as the data-dependent nature of expert selection places these synchronous transfers directly on the critical path of execution, crippling performance. This paper argues that the I/O bottleneck can be overcome by trading a small amount of cheap, on-device computation to hide the immense cost of data movement. We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading. MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens. This foresight enables a runtime orchestrator to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Advanced Neural Network Applications · IoT and Edge/Fog Computing
