Fate: Fast Edge Inference of Mixture-of-Experts Models via Cross-Layer   Gate

Zhiyuan Fang; Zicong Hong; Yuegui Huang; Yufeng Lyu; Wuhui Chen; Yue; Yu; Fan Yu; Zibin Zheng

arXiv:2502.12224·cs.AI·May 8, 2025

Fate: Fast Edge Inference of Mixture-of-Experts Models via Cross-Layer Gate

Zhiyuan Fang, Zicong Hong, Yuegui Huang, Yufeng Lyu, Wuhui Chen, Yue, Yu, Fan Yu, Zibin Zheng

PDF

Open Access 1 Repo

TL;DR

Fate is an offloading system for Mixture-of-Experts models that uses cross-layer gate inputs for accurate expert prediction, enabling efficient edge inference with high speedups and minimal quality loss.

Contribution

Fate introduces a novel cross-layer gate-based expert prefetching and caching strategy to improve MoE inference efficiency on resource-constrained edge devices.

Findings

01

Achieves up to 4.5x prefill speedup and 4.1x decoding speedup.

02

Expert hit rate reaches 99% with caching strategy.

03

Performance scales across various memory budgets.

Abstract

Large Language Models (LLMs) have demonstrated impressive performance across various tasks, and their application in edge scenarios has attracted significant attention. However, sparse-activated Mixture-of-Experts (MoE) models, which are well suited for edge scenarios, have received relatively little attention due to their high memory demands. Offload-based methods have been proposed to address this challenge, but they face difficulties with expert prediction. Inaccurate expert predictions can result in prolonged inference delays. To promote the application of MoE models in edge scenarios, we propose Fate, an offloading system designed for MoE models to enable efficient inference in resource-constrained environments. The key insight behind Fate is that gate inputs from adjacent layers can be effectively used for expert prefetching, achieving high prediction accuracy without additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MindSpore-scientific-2/code-7/tree/main/Fate
mindspore

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Seismology and Earthquake Studies · Advanced Graph Neural Networks

MethodsSoftmax · Attention Is All You Need · Mixture of Experts · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings