AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts   Language Models

Zihao Zeng; Yibo Miao; Hongcheng Gao; Hao Zhang; Zhijie Deng

arXiv:2406.13233·cs.AI·October 15, 2024·1 cites

AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models

Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng

PDF

Open Access 1 Repo

TL;DR

AdaMoE introduces token-adaptive routing in Mixture-of-Experts models, allowing tokens to select varying numbers of experts with minimal modifications, leading to reduced computational load and improved accuracy.

Contribution

It proposes a novel token-adaptive routing mechanism for MoE models using null experts, enhancing efficiency and performance without complex changes.

Findings

01

Reduces FLOPs by 14.5% on ARC-C dataset.

02

Achieves 1.69% higher accuracy after fine-tuning.

03

Easy to implement and compatible with pre-trained LLMs.

Abstract

Mixture of experts (MoE) has become the standard for constructing production-level large language models (LLMs) due to its promise to boost model capacity without causing significant overheads. Nevertheless, existing MoE methods usually enforce a constant top-k routing for all tokens, which is arguably restrictive because various tokens (e.g., "<EOS>" vs. "apple") may require various numbers of experts for feature abstraction. Lifting such a constraint can help make the most of limited resources and unleash the potential of the model for downstream tasks. In this sense, we introduce AdaMoE to realize token-adaptive routing for MoE, where different tokens are permitted to select a various number of experts. AdaMoE makes minimal modifications to the vanilla MoE with top-k routing -- it simply introduces a fixed number of null experts, which do not consume any FLOPs, to the expert set and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cengzihao/adamoe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Recommender Systems and Techniques · Expert finding and Q&A systems

MethodsSparse Evolutionary Training · Mixture of Experts