# Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding

**Authors:** Zhibin Wang, Zhonghui Zhang, Yuhang Zhou, Zibo Wang, Mo Zhou, Peng Jiang, Weilin Cai, Chengying Huan, Rong Gu, Sheng Zhong, Chen Tian

arXiv: 2508.21706 · 2025-11-03

## TL;DR

This paper introduces SpecMoEOff, a novel approach that employs speculative decoding to improve hardware utilization and significantly accelerate Mixture-of-Experts inference by hiding offloading latency.

## Contribution

It proposes SpecMoEOff, combining speculative decoding with offloading techniques, and develops a dedicated verification kernel and optimizer for enhanced MoE inference performance.

## Key findings

- Achieves up to 2.5x increase in decode throughput
- Effectively hides offloading latency using speculative decoding
- Improves hardware utilization in MoE inference

## Abstract

Recent advancements in Mixture of Experts (MoE) models have significantly increased their parameter scale as well as model performance. Extensive offloading techniques have been proposed to address the GPU memory limitations of MoE inference. However, due to the I/O bottleneck and sparse computation of MoE models, existing offloading techniques still suffer from low hardware utilization. To fully utilize the hardware resources, we propose SpecMoEOff, which employs the speculative decoding technique to enlarge the workload of each expert. SpecMoEOff orchestrates the GPU and CPU by both theoretical and empirical roofline analysis. In addition, we develop a dedicated CPU chunked attention verification kernel to fit the speculative decoding in offloading scenarios as well as minimizing the additional overhead led by draft models. SpecMoEOff further integrates an optimizer to automatically tune the hyperparameters of speculative decoding under given hardware and workload. Experimental results show that SpecMoEOff achieves up to 2.5x decode throughput improvement over the state-of-the-art MoE offloading techniques.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21706/full.md

## Figures

31 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21706/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/2508.21706/full.md

---
Source: https://tomesphere.com/paper/2508.21706