Towards an empirical understanding of MoE design choices

Dongyang Fan; Bettina Messmer; Martin Jaggi

arXiv:2402.13089·cs.LG·February 21, 2024·1 cites

Towards an empirical understanding of MoE design choices

Dongyang Fan, Bettina Messmer, Martin Jaggi

PDF

Open Access

TL;DR

This paper systematically evaluates how different design choices in Mixture of Experts models affect validation performance, revealing insights about routing strategies and expert specialization.

Contribution

It provides empirical evidence on the impact of routing strategies and compares learned versus random routing in MoEs.

Findings

01

Learned routing may not be necessary for good performance.

02

Sequence-level routing can lead to topic-specific weak expert specialization.

03

Token-level routing tends to produce syntax-specific expert specialization.

Abstract

In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels. We also present empirical evidence showing comparable performance between a learned router and a frozen, randomly initialized router, suggesting that learned routing may not be essential. Our study further reveals that Sequence-level routing can result in topic-specific weak expert specialization, in contrast to syntax specialization observed with Token-level routing.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsUsability and User Interface Design · Innovative Approaches in Technology and Social Development · Technology Use by Older Adults