AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for   Efficient MoE Inference

Shuzhang Zhong; Ling Liang; Yuan Wang; Runsheng Wang; Ru Huang; Meng; Li

arXiv:2408.10284·cs.LG·August 21, 2024

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, Meng, Li

PDF

1 Repo

TL;DR

AdapMoE introduces an adaptive gating and management framework for MoE models that dynamically adjusts expert activation and optimizes loading strategies, significantly improving inference efficiency on edge devices.

Contribution

It presents a novel sensitivity-based expert gating method combined with prefetching and cache management for efficient MoE inference.

Findings

01

Reduces activated experts by 25% on average

02

Achieves 1.35x speedup in inference

03

Maintains accuracy despite efficiency improvements

Abstract

Mixture-of-Experts (MoE) models are designed to enhance the efficiency of large language models (LLMs) without proportionally increasing the computational demands. However, their deployment on edge devices still faces significant challenges due to high on-demand loading overheads from managing sparsely activated experts. This paper introduces AdapMoE, an algorithm-system co-design framework for efficient MoE inference. AdapMoE features adaptive expert gating and management to reduce the on-demand loading overheads. We observe the heterogeneity of experts loading across layers and tokens, based on which we propose a sensitivity-based strategy to adjust the number of activated experts dynamically. Meanwhile, we also integrate advanced prefetching and cache management techniques to further reduce the loading latency. Through comprehensive evaluations on various platforms, we demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pku-sec-lab/adapmoe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMixture of Experts