DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models
Maryam Akhavan Aghdam, Hongpeng Jin, Yanzhao Wu

TL;DR
This paper introduces DA-MoE, a dynamic routing mechanism for Mixture-of-Experts models that allocates experts based on token importance, leading to improved performance on NLP benchmarks.
Contribution
The paper presents a novel dynamic router that uses Transformer attention to allocate a variable number of experts per token, enhancing MoE model efficiency and accuracy.
Findings
DA-MoE outperforms existing MoE models on GLUE benchmark.
Dynamic expert allocation improves model efficiency.
Token importance effectively guides expert routing.
Abstract
Transformer-based Mixture-of-Experts (MoE) models have been driving several recent technological advancements in Natural Language Processing (NLP). These MoE models adopt a router mechanism to determine which experts to activate for routing input tokens. However, existing router mechanisms allocate a fixed number of experts to each token, which neglects the varying importance of different input tokens. In this study, we propose a novel dynamic router mechanism that Dynamically Allocates a variable number of experts for Mixture-of-Experts (DA-MoE) models based on an effective token importance measure. First, we show that the Transformer attention mechanism provides a natural and effective way of calculating token importance. Second, we propose a dynamic router mechanism that effectively decides the optimal number of experts (K) and allocates the top-K experts for each input token. Third,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Data-Driven Disease Surveillance · Air Quality Monitoring and Forecasting
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer · Adam
