In-depth Analysis on Caching and Pre-fetching in Mixture of Experts Offloading

Shuning Lin; Yifan He; Yitong Chen

arXiv:2511.05814·cs.LG·November 11, 2025

In-depth Analysis on Caching and Pre-fetching in Mixture of Experts Offloading

Shuning Lin, Yifan He, Yitong Chen

PDF

Open Access

TL;DR

This paper provides an in-depth analysis of caching and pre-fetching techniques in MoE offloading, proposing optimizations and offering insights into MoE behavior to improve deployment efficiency on limited-memory devices.

Contribution

It introduces a detailed analysis of expert activation and caching behavior, proposes LFU caching, and demonstrates the potential of speculative expert pre-fetching for MoE models.

Findings

01

LFU caching outperforms LRU in MoE offloading

02

Speculative pre-fetching significantly reduces latency

03

Detailed traces reveal expert activation patterns

Abstract

In today's landscape, Mixture of Experts (MoE) is a crucial architecture that has been used by many of the most advanced models. One of the major challenges of MoE models is that they usually require much more memory than their dense counterparts due to their unique architecture, and hence are harder to deploy in environments with limited GPU memory, such as edge devices. MoE offloading is a promising technique proposed to overcome this challenge, especially if it is enhanced with caching and pre-fetching, but prior work stopped at suboptimal caching algorithm and offered limited insights. In this work, we study MoE offloading in depth and make the following contributions: 1. We analyze the expert activation and LRU caching behavior in detail and provide traces. 2. We propose LFU caching optimization based on our analysis and obtain strong improvements from LRU. 3. We implement and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · IoT and Edge/Fog Computing · Stochastic Gradient Optimization Techniques