DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks
G\"okdeniz G\"ulmez

TL;DR
DynaMoE introduces a flexible MoE framework with dynamic token-level expert activation and adaptive capacity scheduling, improving efficiency and performance across tasks by tailoring expert use to input complexity and task scale.
Contribution
It relaxes fixed expert activation and uniform allocation assumptions, proposing a dynamic routing mechanism and multiple scheduling strategies, with theoretical analysis and extensive empirical validation.
Findings
Dynamic routing improves expressivity and efficiency.
Descending schedules excel in image classification tasks.
Optimal expert schedules vary with task and model scale.
Abstract
Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling neural networks while maintaining computational efficiency. However, standard MoE implementations rely on two rigid design assumptions: (1) fixed Top-K routing where exactly K experts are activated per token, and (2) uniform expert allocation across all layers. This paper introduces DynaMoE, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation. DynaMoE introduces a principled routing mechanism where the number of active experts per token varies based on input complexity. Concurrently, the framework implements six distinct scheduling strategies for distributing expert capacity across network depth, including descending, ascending, pyramid, and wave patterns. We theoretically analyze the expressivity gains of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Mobile Crowdsensing and Crowdsourcing
