Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models

Shien Zhu; Samuel Bohl; Robin Oester; Gustavo Alonso

arXiv:2511.10676·cs.CL·November 17, 2025

Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models

Shien Zhu, Samuel Bohl, Robin Oester, Gustavo Alonso

PDF

Open Access

TL;DR

This paper introduces a lightweight pre-attention expert prediction method for MoE large language models, significantly improving expert prediction accuracy and enabling efficient prefetching, including in the first layer.

Contribution

It proposes a novel ranking-preserving approach using simple linear functions and pre-attention activations for accurate expert prediction in MoE LLMs.

Findings

01

Achieves over 93% prediction accuracy on multiple models

02

Improves prediction accuracy by about 15% over state-of-the-art methods

03

Supports expert prefetching in the first layer

Abstract

Mixture-of-Experts (MoE) Large Language Models (LLMs) efficiently scale-up the model while keeping relatively low inference cost. As MoE models only activate part of the experts, related work has proposed expert prediction and caching methods to prefetch the experts for faster inference. However, existing approaches utilize the activations from the previous layer for prediction, incurring low accuracy and leave the first layer unoptimized. Applying complex layers or even training standalone networks for better prediction introduces high computation overhead. In this paper, we propose pre-attention expert prediction to achieve accurate and lightweight expert prefetching. The key insight is that some functions in LLMs are ranking-preserving, indicating that matching the ranking of selected experts using simple linear functions is possible. Therefore, we utilize the activations before the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Mobile Crowdsensing and Crowdsourcing · Big Data and Digital Economy