Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization
Rizhen Hu, Yuan Cao, Boao Kong, Mou Sun, Kun Yuan

TL;DR
This paper introduces two plug-and-play regularization losses for Mixture-of-Experts models that improve expert specialization and routing efficiency without architectural changes, leading to better performance and faster inference.
Contribution
It proposes novel intra- and cross-layer regularization losses that enhance MoE expert specialization and routing coherence without modifying existing architectures.
Findings
Improved expert specialization and routing efficiency.
Consistent performance gains across various benchmarks.
Faster inference due to more stable expert pathways.
Abstract
Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap -- redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top- routing probabilities across adjacent layers, establishing coherent expert pathways through network…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning
