Load Balancing Mixture of Experts with Similarity Preserving Routers

Nabil Omi; Siddhartha Sen; Ali Farhadi

arXiv:2506.14038·cs.LG·October 14, 2025

Load Balancing Mixture of Experts with Similarity Preserving Routers

Nabil Omi, Siddhartha Sen, Ali Farhadi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel load balancing loss for sparse Mixture of Experts models that preserves input similarity, leading to faster training convergence and reduced redundancy in expert utilization.

Contribution

The paper proposes a new load balancing loss that maintains relational structure among inputs, improving expert utilization and training efficiency in MoE models.

Findings

01

36% faster convergence with the new loss

02

Lower redundancy in expert usage

03

Improved load balancing compared to existing methods

Abstract

Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks by activating only a subset of parameters ("experts") for each input. A learned router computes a distribution over these experts, and assigns input tokens to a small subset. However, without auxiliary balancing mechanisms, routers often converge to using only a few experts, severely limiting model capacity and degrading performance. Most current load balancing mechanisms encourage a distribution over experts that resembles a roughly uniform distribution of experts per token. During training, this can result in inconsistent routing behavior, resulting in the model spending its capacity to learn redundant knowledge. We address this by introducing a novel load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

The authors provide a simple and principled approach to regularizing an MoE router in order to promote expert regularization and mitigate the redundancies in traditional load balancing loss approaches. This appears to substantially speed up MoE training. The empirical evidence is thorough and convincing.

Weaknesses

Minor: The term load-balancing loss for SIMBAL seems slightly incorrect.

Reviewer 02Rating 6Confidence 4

Strengths

The idea is clear and intuitive. A code snippet is also provided in the appendix, showing the simplicity of the implementation. Moreover, the auxiliary loss hyperparameter is not sensitive to tuning, which makes this method easily integratable into existing architectures. The empirical gains are strong: training convergence is significantly faster with SIMBAL loss, and the final perplexity and benchmark scores are also better for MoE trained with SIMBAL. New metric for expert redundancy that q

Weaknesses

The authors mention that stronger benchmark performance is realized when training on significantly more data than the datasets used in the paper. If possible, it would be good to see some results on how the SIMBAL method performs when training on these larger datasets, compared to traditional MoE. The paper motivates orthogonality as angle preserving, but a more formal connection between router orthogonality and reduced redundancy / improved specialization (maybe via an analysis of routing vari

Reviewer 03Rating 2Confidence 3

Strengths

The idea of "similar tokens should be routed similarly" to preserve semantic consistency, is interesting. However, the methodology and computation cost are questionable.

Weaknesses

Please see the question block.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications