Routers in Vision Mixture of Experts: An Empirical Study
Tianlin Liu, Mathieu Blondel, Carlos Riquelme, Joan Puigcerver

TL;DR
This paper provides a comprehensive empirical analysis of routing mechanisms in vision Mixture-of-Experts models, comparing various routers and revealing insights on their performance and design choices.
Contribution
It introduces a unified MoE formulation with two parametric routing tensors and evaluates six routers, including new variants, across vision tasks.
Findings
Expert Choice routers outperform Token Choice in sparse MoEs.
Soft MoEs outperform sparse MoEs under fixed compute budgets.
Language-model routers adapt well to vision tasks.
Abstract
Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost. A key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens). In this paper, we present a comprehensive study of routers in MoEs for computer vision tasks. We introduce a unified MoE formulation that subsumes different MoEs with two parametric routing tensors. This formulation covers both sparse MoE, which uses a binary or hard assignment between experts and tokens, and soft MoE, which uses a soft assignment between experts and weighted combinations of tokens. Routers for sparse MoEs can be further grouped into two variants: Token Choice, which matches experts to each token, and Expert Choice, which matches tokens to each expert. We conduct head-to-head experiments with 6 different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptics and Image Analysis
