Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization

Rizhen Hu; Yuan Cao; Boao Kong; Mou Sun; Kun Yuan

arXiv:2602.14159·cs.LG·February 17, 2026

Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization

Rizhen Hu, Yuan Cao, Boao Kong, Mou Sun, Kun Yuan

PDF

Open Access

TL;DR

This paper introduces two plug-and-play regularization losses for Mixture-of-Experts models that improve expert specialization and routing efficiency without architectural changes, leading to better performance and faster inference.

Contribution

It proposes novel intra- and cross-layer regularization losses that enhance MoE expert specialization and routing coherence without modifying existing architectures.

Findings

01

Improved expert specialization and routing efficiency.

02

Consistent performance gains across various benchmarks.

03

Faster inference due to more stable expert pathways.

Abstract

Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap -- redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top- $k$ routing probabilities across adjacent layers, establishing coherent expert pathways through network…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning