MoE-Beyond: Learning-Based Expert Activation Prediction on Edge Devices
Nishant Gavhane, Arush Mehrotra, Rohit Chawla, Peter Proenca

TL;DR
MoE-Beyond introduces a learning-based expert activation predictor for edge devices, significantly improving cache efficiency and enabling large-scale MoE models to operate effectively within memory constraints.
Contribution
This work presents a novel transformer-based predictor trained on expert activation traces, outperforming heuristics in cache hit rate and generalizing across unseen prompts.
Findings
Achieves 97.5% accuracy in expert activation prediction.
Improves GPU cache hit rate from 17% to 72%.
Outperforms heuristic caching strategies.
Abstract
The deployment of large-scale Mixture-of-Experts (MoE) models on edge devices presents significant challenges due to memory constraints. While MoE architectures enable efficient utilization of computational resources by activating only a subset of experts per inference, they require careful memory management to operate efficiently in resource-constrained environments. Traditional heuristic-based expert caching strategies such as MoE-Infinity struggle to maintain high cache hit rates as models parameters scale. In this work, we introduce MoE-Beyond, a learning-based expert activation predictor trained to predict expert activations during autoregressive decoding. By framing the task as a multi-label sequence prediction problem, we train a lightweight transformer model on 66 million expert activation traces extracted from LDJnr-Puffin dataset [5] using DeepSeek-V2-Chat-Lite MoE. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
