Towards an empirical understanding of MoE design choices
Dongyang Fan, Bettina Messmer, Martin Jaggi

TL;DR
This paper systematically evaluates how different design choices in Mixture of Experts models affect validation performance, revealing insights about routing strategies and expert specialization.
Contribution
It provides empirical evidence on the impact of routing strategies and compares learned versus random routing in MoEs.
Findings
Learned routing may not be necessary for good performance.
Sequence-level routing can lead to topic-specific weak expert specialization.
Token-level routing tends to produce syntax-specific expert specialization.
Abstract
In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels. We also present empirical evidence showing comparable performance between a learned router and a frozen, randomly initialized router, suggesting that learned routing may not be essential. Our study further reveals that Sequence-level routing can result in topic-specific weak expert specialization, in contrast to syntax specialization observed with Token-level routing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUsability and User Interface Design · Innovative Approaches in Technology and Social Development · Technology Use by Older Adults
