Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding

Zhibin Wang; Zhonghui Zhang; Yuhang Zhou; Zibo Wang; Mo Zhou; Peng Jiang; Weilin Cai; Chengying Huan; Rong Gu; Sheng Zhong; Chen Tian

arXiv:2508.21706·cs.DC·November 3, 2025

Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding

Zhibin Wang, Zhonghui Zhang, Yuhang Zhou, Zibo Wang, Mo Zhou, Peng Jiang, Weilin Cai, Chengying Huan, Rong Gu, Sheng Zhong, Chen Tian

PDF

Open Access

TL;DR

This paper introduces SpecMoEOff, a novel approach that employs speculative decoding to improve hardware utilization and significantly accelerate Mixture-of-Experts inference by hiding offloading latency.

Contribution

It proposes SpecMoEOff, combining speculative decoding with offloading techniques, and develops a dedicated verification kernel and optimizer for enhanced MoE inference performance.

Findings

01

Achieves up to 2.5x increase in decode throughput

02

Effectively hides offloading latency using speculative decoding

03

Improves hardware utilization in MoE inference

Abstract

Recent advancements in Mixture of Experts (MoE) models have significantly increased their parameter scale as well as model performance. Extensive offloading techniques have been proposed to address the GPU memory limitations of MoE inference. However, due to the I/O bottleneck and sparse computation of MoE models, existing offloading techniques still suffer from low hardware utilization. To fully utilize the hardware resources, we propose SpecMoEOff, which employs the speculative decoding technique to enlarge the workload of each expert. SpecMoEOff orchestrates the GPU and CPU by both theoretical and empirical roofline analysis. In addition, we develop a dedicated CPU chunked attention verification kernel to fit the speculative decoding in offloading scenarios as well as minimizing the additional overhead led by draft models. SpecMoEOff further integrates an optimizer to automatically…

Tables3

Table 1. Table 1 . Hardware configurations and cost of executing operator implementations.

	Hardware configuration	$P_{GPU}$ (TFLOPS)	$B_{GPU}$ (GB/s)	$B_{CPU-GPU}$ (GB/s)	$P_{CPU}$ (TFLOPS)	$B_{CPU}$ (GB/s)
Hardware	A30 + INTEL Gold 6426Y	165	933	25	2.43	357
Hardware	4090D + INTEL Gold 5418Y	83	1008	23	1.45	197
	Operator implementation	GPU Operations	GPU Memory access	Memory transfer	CPU Operations	CPU Memory access
Software	MoE (large batch) (cao2025moe, )	$3 \times e \times n_{activate} \times b$	$3 \times n_{expert} \times e$	$3 \times n_{expert} \times e$	0	0
	MoE (batch=1) (yu2025fmoefinegrainedexpertoffloading, )	$3 \times e \times n_{activate}$	$3 \times n_{activate} \times e$	$3 \times n_{activate} \times e \times r_{miss}$	0	0
	attention (in CPU) (cao2025moe, )	0	0	0	$2 \times b \times s \times h$	$2 \times b \times s \times h / g$
	attention (to GPU) (sheng2023flexgen, )	$2 \times b \times s \times h$	$2 \times b \times s \times h / g$	$2 \times b \times s \times h / g$	0	0

Table 2. Table 2 . Datasets, where μ s \mu_{s} and σ s \sigma_{s} denote the average and standard deviation of the sequence lengths, respectively.

Dataset	Task	$μ_{s}$	$σ_{s}$
APPS (hendrycksapps2021, )	Coding	566.17	271.80
CNN/DailyMail (cnndailymail, )	Summarization/QA	1005.56	491.66

Table 3. Table 3 . Performance Comparison: Actual vs. Estimated

	Actual	Estimated	Error
Target model	4.39s	4.03s	8.2%
CPU Attention	4.29s	3.88s	10.6%
GPU MoE	3.53s	3.17s	10.2%
HtoD Transfer	3.70s	3.55s	4.1%
Draft model	0.56s	0.41s	26.8%
GPU Part	0.42s	0.35s	16.7%
CPU Part	0.54s	0.41s	24.1%
Others	0.097s	0.12s	23.7%
Iteration	5.05 s	4.56 s	9.7%

Equations4

FFN (x) = W_{down} (σ (W_{gate} x) ⊙ (W_{up} x)),

FFN (x) = W_{down} (σ (W_{gate} x) ⊙ (W_{up} x)),

Given H, M, R, find P^{*} = ar g max_{P = (b, m, k, S)} T (H, M, R, P) .

Given H, M, R, find P^{*} = ar g max_{P = (b, m, k, S)} T (H, M, R, P) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTarget Tracking and Data Fusion in Sensor Networks · Anomaly Detection Techniques and Applications · Distributed Sensor Networks and Detection Algorithms

Full text

Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding

Zhibin Wang