In-depth Analysis on Caching and Pre-fetching in Mixture of Experts Offloading
Shuning Lin, Yifan He, Yitong Chen

TL;DR
This paper provides an in-depth analysis of caching and pre-fetching techniques in MoE offloading, proposing optimizations and offering insights into MoE behavior to improve deployment efficiency on limited-memory devices.
Contribution
It introduces a detailed analysis of expert activation and caching behavior, proposes LFU caching, and demonstrates the potential of speculative expert pre-fetching for MoE models.
Findings
LFU caching outperforms LRU in MoE offloading
Speculative pre-fetching significantly reduces latency
Detailed traces reveal expert activation patterns
Abstract
In today's landscape, Mixture of Experts (MoE) is a crucial architecture that has been used by many of the most advanced models. One of the major challenges of MoE models is that they usually require much more memory than their dense counterparts due to their unique architecture, and hence are harder to deploy in environments with limited GPU memory, such as edge devices. MoE offloading is a promising technique proposed to overcome this challenge, especially if it is enhanced with caching and pre-fetching, but prior work stopped at suboptimal caching algorithm and offered limited insights. In this work, we study MoE offloading in depth and make the following contributions: 1. We analyze the expert activation and LRU caching behavior in detail and provide traces. 2. We propose LFU caching optimization based on our analysis and obtain strong improvements from LRU. 3. We implement and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · IoT and Edge/Fog Computing · Stochastic Gradient Optimization Techniques
