AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models
Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng

TL;DR
AdaMoE introduces token-adaptive routing in Mixture-of-Experts models, allowing tokens to select varying numbers of experts with minimal modifications, leading to reduced computational load and improved accuracy.
Contribution
It proposes a novel token-adaptive routing mechanism for MoE models using null experts, enhancing efficiency and performance without complex changes.
Findings
Reduces FLOPs by 14.5% on ARC-C dataset.
Achieves 1.69% higher accuracy after fine-tuning.
Easy to implement and compatible with pre-trained LLMs.
Abstract
Mixture of experts (MoE) has become the standard for constructing production-level large language models (LLMs) due to its promise to boost model capacity without causing significant overheads. Nevertheless, existing MoE methods usually enforce a constant top-k routing for all tokens, which is arguably restrictive because various tokens (e.g., "<EOS>" vs. "apple") may require various numbers of experts for feature abstraction. Lifting such a constraint can help make the most of limited resources and unleash the potential of the model for downstream tasks. In this sense, we introduce AdaMoE to realize token-adaptive routing for MoE, where different tokens are permitted to select a various number of experts. AdaMoE makes minimal modifications to the vanilla MoE with top-k routing -- it simply introduces a fixed number of null experts, which do not consume any FLOPs, to the expert set and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Recommender Systems and Techniques · Expert finding and Q&A systems
MethodsSparse Evolutionary Training · Mixture of Experts
