Hierarchical Mixture-of-Experts with Two-Stage Optimization
Gleb Molodtsov, Alexander Miasnikov, Aleksandr Beznosikov

TL;DR
Hi-MoE introduces a hierarchical routing framework for sparse Mixture-of-Experts models, balancing load and promoting specialization, leading to improved performance and robustness across NLP and vision tasks.
Contribution
The paper proposes a novel hierarchical routing approach that decomposes expert assignment into inter-group balancing and intra-group specialization, enhancing stability and performance.
Findings
Consistent improvements over recent MoE baselines in NLP and vision.
In large-scale pre-training, Hi-MoE-7B reduces perplexity by 5.6%.
Achieves 40% better expert balance compared to OLMoE-7B.
Abstract
Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive diversity often causes routing collapse. We propose Hi-MoE, a grouped MoE framework that decomposes routing control into two coupled levels: (i) inter-group balancing that enforces fair traffic across expert groups, and (ii) intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. Our analysis provides a principled explanation of how our hierarchical objectives reshape the router, thereby promoting stable specialization and mitigating collapse. We observe consistent improvements over recent sparse-routing and grouped-MoE baselines across NLP and vision benchmarks, and confirm robustness via scaling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
