MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

Wenfeng Wang; Jiacheng Liu; Xiaofeng Hou; Xinfeng Xia; Peng Tang; Mingxuan Zhang; Chao Li; Minyi Guo

arXiv:2511.14102·cs.LG·November 19, 2025

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

Wenfeng Wang, Jiacheng Liu, Xiaofeng Hou, Xinfeng Xia, Peng Tang, Mingxuan Zhang, Chao Li, Minyi Guo

PDF

Open Access

TL;DR

MoE-SpeQ introduces a speculative execution and prefetching system for MoE models that significantly reduces I/O bottlenecks, enabling faster inference on memory-limited devices by overlapping computation with data transfer.

Contribution

The paper presents MoE-SpeQ, a novel co-designed system that predicts expert requirements to prefetch data, effectively hiding I/O latency and improving inference speed on constrained hardware.

Findings

01

Achieves up to 2.34x speedup over existing offloading methods.

02

Demonstrates effective hiding of I/O latency through speculative prefetching.

03

Provides a principled approach for data-dependent memory management in resource-limited environments.

Abstract

The immense memory requirements of state-of-the-art Mixture-of-Experts (MoE) models present a significant challenge for inference, often exceeding the capacity of a single accelerator. While offloading experts to host memory is a common solution, it introduces a severe I/O bottleneck over the PCIe bus, as the data-dependent nature of expert selection places these synchronous transfers directly on the critical path of execution, crippling performance. This paper argues that the I/O bottleneck can be overcome by trading a small amount of cheap, on-device computation to hide the immense cost of data movement. We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading. MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens. This foresight enables a runtime orchestrator to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Advanced Neural Network Applications · IoT and Edge/Fog Computing