DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

Maryam Akhavan Aghdam; Hongpeng Jin; Yanzhao Wu

arXiv:2409.06669·cs.LG·September 11, 2024·2 cites

DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

Maryam Akhavan Aghdam, Hongpeng Jin, Yanzhao Wu

PDF

Open Access

TL;DR

This paper introduces DA-MoE, a dynamic routing mechanism for Mixture-of-Experts models that allocates experts based on token importance, leading to improved performance on NLP benchmarks.

Contribution

The paper presents a novel dynamic router that uses Transformer attention to allocate a variable number of experts per token, enhancing MoE model efficiency and accuracy.

Findings

01

DA-MoE outperforms existing MoE models on GLUE benchmark.

02

Dynamic expert allocation improves model efficiency.

03

Token importance effectively guides expert routing.

Abstract

Transformer-based Mixture-of-Experts (MoE) models have been driving several recent technological advancements in Natural Language Processing (NLP). These MoE models adopt a router mechanism to determine which experts to activate for routing input tokens. However, existing router mechanisms allocate a fixed number of experts to each token, which neglects the varying importance of different input tokens. In this study, we propose a novel dynamic router mechanism that Dynamically Allocates a variable number of experts for Mixture-of-Experts (DA-MoE) models based on an effective token importance measure. First, we show that the Transformer attention mechanism provides a natural and effective way of calculating token importance. Second, we propose a dynamic router mechanism that effectively decides the optimal number of experts (K) and allocates the top-K experts for each input token. Third,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Data-Driven Disease Surveillance · Air Quality Monitoring and Forecasting

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer · Adam