Fate: Fast Edge Inference of Mixture-of-Experts Models via Cross-Layer Gate
Zhiyuan Fang, Zicong Hong, Yuegui Huang, Yufeng Lyu, Wuhui Chen, Yue, Yu, Fan Yu, Zibin Zheng

TL;DR
Fate is an offloading system for Mixture-of-Experts models that uses cross-layer gate inputs for accurate expert prediction, enabling efficient edge inference with high speedups and minimal quality loss.
Contribution
Fate introduces a novel cross-layer gate-based expert prefetching and caching strategy to improve MoE inference efficiency on resource-constrained edge devices.
Findings
Achieves up to 4.5x prefill speedup and 4.1x decoding speedup.
Expert hit rate reaches 99% with caching strategy.
Performance scales across various memory budgets.
Abstract
Large Language Models (LLMs) have demonstrated impressive performance across various tasks, and their application in edge scenarios has attracted significant attention. However, sparse-activated Mixture-of-Experts (MoE) models, which are well suited for edge scenarios, have received relatively little attention due to their high memory demands. Offload-based methods have been proposed to address this challenge, but they face difficulties with expert prediction. Inaccurate expert predictions can result in prolonged inference delays. To promote the application of MoE models in edge scenarios, we propose Fate, an offloading system designed for MoE models to enable efficient inference in resource-constrained environments. The key insight behind Fate is that gate inputs from adjacent layers can be effectively used for expert prefetching, achieving high prediction accuracy without additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Seismology and Earthquake Studies · Advanced Graph Neural Networks
MethodsSoftmax · Attention Is All You Need · Mixture of Experts · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
