Advancing Expert Specialization for Better MoE

Hongcan Guo; Haolang Lu; Guoshun Nan; Bolun Chu; Jialin Zhuang; Yuan Yang; Wenhao Che; Xinye Cao; Sicong Leng; Qimei Cui; and Xudong Jiang

arXiv:2505.22323·cs.CL·January 27, 2026

Advancing Expert Specialization for Better MoE

Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Xinye Cao, Sicong Leng, Qimei Cui, and Xudong Jiang

PDF

Open Access

TL;DR

This paper introduces a simple method to improve expert specialization in Mixture-of-Experts models by adding orthogonality and variance losses, leading to significant performance gains without architectural changes.

Contribution

It proposes two new objectives that enhance expert specialization in MoE models, addressing issues caused by auxiliary load balancing loss.

Findings

01

Up to 23.79% performance improvement on benchmarks

02

Enhanced expert specialization demonstrated across models

03

Maintains load balancing without extra architecture

Abstract

Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Mobile Crowdsensing and Crowdsourcing · Domain Adaptation and Few-Shot Learning

MethodsMixture of Experts