TL;DR
This paper investigates when sparse Mixture-of-Experts (MoE) models improve vision classification performance, emphasizing the importance of compute leverage and multi-expert routing, with extensive experiments across benchmarks.
Contribution
It identifies the compute-leverage pattern and the necessity of multi-expert routing for effective sparse MoE deployment in vision tasks.
Findings
Positive accuracy gains require routing a substantial fraction of FLOPs.
Multi-expert routing ($k \,\geq\, 2$) is necessary at large scales like ImageNet.
Softmax-based per-sample dispatch can rescue performance in certain settings.
Abstract
Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top- routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing () is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
