Hierarchical Mixture-of-Experts with Two-Stage Optimization

Gleb Molodtsov; Alexander Miasnikov; Aleksandr Beznosikov

arXiv:2605.08292·cs.LG·May 12, 2026

Hierarchical Mixture-of-Experts with Two-Stage Optimization

Gleb Molodtsov, Alexander Miasnikov, Aleksandr Beznosikov

PDF

TL;DR

Hi-MoE introduces a hierarchical routing framework for sparse Mixture-of-Experts models, balancing load and promoting specialization, leading to improved performance and robustness across NLP and vision tasks.

Contribution

The paper proposes a novel hierarchical routing approach that decomposes expert assignment into inter-group balancing and intra-group specialization, enhancing stability and performance.

Findings

01

Consistent improvements over recent MoE baselines in NLP and vision.

02

In large-scale pre-training, Hi-MoE-7B reduces perplexity by 5.6%.

03

Achieves 40% better expert balance compared to OLMoE-7B.

Abstract

Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive diversity often causes routing collapse. We propose Hi-MoE, a grouped MoE framework that decomposes routing control into two coupled levels: (i) inter-group balancing that enforces fair traffic across expert groups, and (ii) intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. Our analysis provides a principled explanation of how our hierarchical objectives reshape the router, thereby promoting stable specialization and mitigating collapse. We observe consistent improvements over recent sparse-routing and grouped-MoE baselines across NLP and vision benchmarks, and confirm robustness via scaling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.