$\infty$-MoE: Generalizing Mixture of Experts to Infinite Experts
Shota Takashiro, Takeshi Kojima, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo

TL;DR
This paper introduces $ abla$-MoE, a novel approach that extends the Mixture of Experts model to an infinite number of experts by selecting parameter subsets based on continuous values, improving training stability and efficiency.
Contribution
It proposes a continuous expert selection mechanism enabling infinite experts, enhancing training stability and flexibility in model size and inference trade-offs.
Findings
Achieves comparable performance to larger dense models with fewer parameters.
Allows flexible adjustment of experts at inference for accuracy-speed trade-off.
Improves accuracy by up to 2.5% over traditional MoE.
Abstract
The Mixture of Experts (MoE) selects a few feed-forward networks (FFNs) per token, achieving an effective trade-off between computational cost and performance. In conventional MoE, each expert is treated as entirely independent, and experts are combined in a discrete space. As a result, when the number of experts increases, it becomes difficult to train each expert effectively. To stabilize training while increasing the number of experts, we propose -MoE that selects a portion of the parameters of large FFNs based on continuous values sampled for each token. By considering experts in a continuous space, this approach allows for an infinite number of experts while maintaining computational efficiency. Experiments show that a GPT-2 Small-based -MoE model, with 129M active and 186M total parameters, achieves comparable performance to a dense GPT-2 Medium with 350M…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis
