SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding

Jehyeon Bang; Eunyeong Cho; Ranggi Hwang; Jinha Chung; Minsoo Rhu

arXiv:2604.10152·cs.AI·April 14, 2026

SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding

Jehyeon Bang, Eunyeong Cho, Ranggi Hwang, Jinha Chung, Minsoo Rhu

PDF

TL;DR

SpecMoE introduces a memory-efficient MoE inference system using speculative decoding, significantly boosting throughput and reducing bandwidth without additional training.

Contribution

It presents a novel speculative decoding algorithm for MoE inference that enhances efficiency and reduces memory use without retraining.

Findings

01

Inference throughput increased by up to 4.30 times

02

Memory and interconnect bandwidth requirements are significantly reduced

03

No additional model training or fine-tuning needed

Abstract

The Mixture-of-Experts (MoE) architecture has emerged as a promising approach to mitigate the rising computational costs of large language models (LLMs) by selectively activating parameters. However, its high memory requirements and sub-optimal parameter efficiency pose significant challenges for efficient deployment. Although CPU-offloaded MoE inference systems have been proposed in the literature, they offer limited efficiency, particularly for large batch sizes. In this work, we propose SpecMoE, a memory-efficient MoE inference system based on our self-assisted speculative decoding algorithm. SpecMoE demonstrates the effectiveness of applying speculative decoding to MoE inference without requiring additional model training or fine-tuning. Our system improves inference throughput by up to $4.30 \times$ , while significantly reducing bandwidth requirements of both memory and interconnect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.