Mixture-of-Experts Models in Vision: Routing, Optimization, and Generalization
Adam Rokah, Daniel Veress, Caleb Caulk, Sourav Sharan

TL;DR
This paper investigates the behavior of mixture-of-experts models in image classification, analyzing their performance, expert utilization, and generalization, and comparing them to dense models on CIFAR10.
Contribution
It provides a detailed analysis of MoE models in vision, including their generalization, curvature properties, and inference efficiency, which has been less explored compared to language models.
Findings
MoE models slightly outperform dense models in validation accuracy.
SoftMoE exhibits higher sharpness metrics than dense and SparseMoE.
Naive routing does not improve inference speed on modern hardware.
Abstract
Mixture-of-Experts (MoE) architectures enable conditional computation by routing inputs to multiple expert subnetworks and are often motivated as a mechanism for scaling large language models. In this project, we instead study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization. We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity. Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization, avoiding expert collapse. To analyze generalization, we compute Hessian-based sharpness metrics at convergence, including the largest eigenvalue and trace of the loss Hessian, evaluated on both training and test data. We find that SoftMoE exhibits higher sharpness by these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
