On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions
Huy Nguyen, Xing Han, Carl Harris, Suchi Saria, Nhat Ho

TL;DR
This paper explores the use of Laplace gating functions in Hierarchical Mixture of Experts models, demonstrating theoretical benefits and empirical performance improvements over traditional Softmax gating in complex tasks.
Contribution
It introduces Laplace gating functions into HMoE, showing they improve convergence and specialization compared to Softmax gating, supported by theoretical and empirical evidence.
Findings
Laplace gating accelerates expert convergence.
Laplace gating enhances expert specialization.
Modified HMoE outperforms traditional models in diverse tasks.
Abstract
With the growing prominence of the Mixture of Experts (MoE) architecture in developing large-scale foundation models, we investigate the Hierarchical Mixture of Experts (HMoE), a specialized variant of MoE that excels in handling complex inputs and improving performance on targeted tasks. Our analysis highlights the advantages of using the Laplace gating function over the traditional Softmax gating within the HMoE frameworks. We theoretically demonstrate that applying the Laplace gating function at both levels of the HMoE model helps eliminate undesirable parameter interactions caused by the Softmax gating and, therefore, accelerates the expert convergence as well as enhances the expert specialization. Empirical validation across diverse scenarios supports these theoretical claims. This includes large-scale multimodal tasks, image classification, and latent domain discovery and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Mobile Crowdsensing and Crowdsourcing · Distributed Sensor Networks and Detection Algorithms
MethodsSoftmax · Mixture of Experts
