Towards Understanding Mixture of Experts in Deep Learning
Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, Yuanzhi Li

TL;DR
This paper provides a formal analysis of Mixture-of-Experts (MoE) layers in deep learning, explaining how they leverage cluster structures and non-linearity to improve learning and avoid collapse, with theoretical and empirical support.
Contribution
It offers the first formal theoretical understanding of how MoE layers enhance neural network performance and learn cluster-center features, supported by empirical results.
Findings
MoE improves learning by exploiting cluster structures.
Experts as nonlinear CNNs can learn complex problems.
Routers learn cluster-center features to simplify tasks.
Abstract
The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by a router, has achieved great success in deep learning. However, the understanding of such architecture remains elusive. In this paper, we formally study how the MoE layer improves the performance of neural network learning and why the mixture model will not collapse into a single model. Our empirical results suggest that the cluster structure of the underlying problem and the non-linearity of the expert are pivotal to the success of MoE. To further understand this, we consider a challenging classification problem with intrinsic cluster structures, which is hard to learn using a single expert. Yet with the MoE layer, by choosing the experts as two-layer nonlinear convolutional neural networks (CNNs), we show that the problem can be learned successfully. Furthermore, our theory shows that the router can learn the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Domain Adaptation and Few-Shot Learning · Privacy-Preserving Technologies in Data
