Expert Divergence Learning for MoE-based Language Models
Jiaang Li, Haibin Chen, Langming Liu, Yujin Yuan, Yadao Wang, Yizhen Zhang, Chengting Yu, Xin Tong, Weidong Zhang, Shilei Liu, Wenbo Su, Bo Zheng

TL;DR
This paper introduces Expert Divergence Learning, a pre-training strategy for MoE language models that promotes expert specialization by maximizing divergence between routing distributions, leading to better performance and reduced homogenization.
Contribution
It proposes a novel auxiliary loss that encourages functional diversity among experts in MoE models using domain labels, improving specialization and downstream performance.
Findings
Models with Expert Divergence Learning outperform baselines on multiple benchmarks.
The method reduces expert homogenization and enhances functional specialization.
Training overhead remains negligible with the new divergence-based loss.
Abstract
The Mixture-of-Experts (MoE) architecture is a powerful technique for scaling language models, yet it often suffers from expert homogenization, where experts learn redundant functionalities, thereby limiting MoE's full potential. To address this, we introduce Expert Divergence Learning, a novel pre-training strategy that explicitly encourages functional specialization among experts. Our method incorporates a label-driven auxiliary loss that leverages domain labels inherent in pre-training corpora to maximize the Jensen-Shannon Divergence between the expert routing distributions of different data domains. This optimization objective guides the model to develop diverged routing policies for varied domains and closer routing policies for the same domain, which leads to emergent and organized expert specialization. We validate our approach by pre-training MoE models of up to 15 billion…
Peer Reviews
Decision·ICLR 2026 Poster
1. Specialization in MoEs is often underexplored, but it is highly important for modularity and effectiveness. 2. The idea is quite novel and very intuitive. 3. A clear trend in language modeling loss suggests that the method proposed carries a strong potential.
1. I think the main weakness of the paper is the results, where there is no clear significant improvement in the majority of downstream tasks. More concretely, there are some items leading me to suspect the results: a. Although the model size increases significantly (both total and active params), there is no monotonic increase in evals with large model size (except ARC_e). b. Improvement claimed by the proposed method is quite uneven, mainly concentrated on ARC_e c. Results are too close to d
The most important (albeit possibly weak as discussed in weaknesses) strength of the method proposed in this paper is the experimental results of section 4.2, namely the fact that adding this loss does indeed improve consistently over training the same model without using this loss. In addition, - The paper is very well written and presented. - The idea is both simple and intuitive. It can probably be implemented in a few lines of code in any framework. - The authors have covered all possible
Given that the idea is simple (which I consider a major strength) the weight falls to the experimental results to convey consistent improvement and at the very least that the method has the intended impact on the model. However, the downstream evaluations have several relatively worrying qualities. Firstly, the average scores are compared and the conclusion that 49 domains is better and bigger models are more amenable to per-domain specialization. Looking closely at the scores though, there are
1. The paper gives a good overview of related work in routing strategies. 2. The experiments are performed over a variety of model sizes and multiple end-tasks are considered for evaluation. Performance is reported both in terms of loss/perplexity and end task performance. 3. The method is simple and clearly presented.
1. The authors do not show whether better results could be achieved with a different routing strategy. So far JSD is only tried on top of z-loss with alpha=1e-3. Is it the best alpha? Would JSD be impactful on top of a different alpha? Is JSD impactful when the routing model has also always-active FFNs? Is JSD impactful when one uses another way to promote uniform expert assignments, e.g. DeepSeek bias based balancing https://arxiv.org/pdf/2408.15664? 2. It is not clear whether the domain cla
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Advanced Graph Neural Networks
