Speculating Experts Accelerates Inference for Mixture-of-Experts

Vivan Madan; Prajwal Singhania; Abhinav Bhatele; Tom Goldstein; Ashwinee Panda

arXiv:2603.19289·cs.LG·March 23, 2026

Speculating Experts Accelerates Inference for Mixture-of-Experts

Vivan Madan, Prajwal Singhania, Abhinav Bhatele, Tom Goldstein, Ashwinee Panda

PDF

Open Access

TL;DR

This paper introduces a speculative expert prefetching method for Mixture-of-Experts models that overlaps expert loading with computation, significantly reducing inference time without sacrificing accuracy.

Contribution

It proposes a novel expert prefetching scheme that predicts future experts using internal representations, enabling efficient memory transfer overlap during inference.

Findings

01

Achieves up to 14% reduction in time per output token.

02

Reliable expert prediction maintains downstream task accuracy.

03

Lightweight estimators improve expert prediction hit rates.

Abstract

Mixture-of-Experts (MoE) models have gained popularity as a means of scaling the capacity of large language models (LLMs) while maintaining sparse activations and reduced per-token compute. However, in memory-constrained inference settings, expert weights must be offloaded to CPU, creating a performance bottleneck from CPU-GPU transfers during decoding. We propose an expert prefetching scheme that leverages currently computed internal model representations to speculate future experts, enabling memory transfers to overlap with computation. Across multiple MoE architectures, we demonstrate that future experts can be reliably predicted by these internal representations. We also demonstrate that executing speculated experts generally maintains downstream task accuracy, thus preserving more effective compute-memory overlap by eliminating the need to re-fetch true router-selected experts.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing · Advanced Neural Network Applications