Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao

TL;DR
This paper introduces an auxiliary loss called expert-router coupling (ERC) for Mixture-of-Experts models, which improves expert specialization and model performance by explicitly aligning router decisions with expert capabilities.
Contribution
We propose a novel, efficient auxiliary loss that tightly couples router decisions with expert capabilities, enhancing MoE training and interpretability.
Findings
ERC loss improves expert specialization in MoE models.
The method is computationally efficient, with fixed cost independent of batch size.
Pre-training on large models demonstrates effectiveness across multiple scales.
Abstract
Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain intermediate activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding…
Peer Reviews
Decision·ICLR 2026 Oral
The method is simple, intuitive, and clearly presented. The experiments and evaluations are thorough and the auxiliary loss does appear to improve performance.
The activation metric is not scale invariant, the auxiliary loss can be decreased in a non-meaningful manner simply by scaling up $W_g^i$. The auxiliary appears to make the gradient dense across experts since activations norms for each token are computed across experts.
Many recent MoE papers focus on improving routing. Methods exist to resolve the mismatch between routing and experts, such as adding an auxiliary loss to teach desired properties in expert specialization, or modifying the model architecture, like AoE (which is also adopted as a baseline in this paper). The proposed method belongs to the former category; it achieves its goals regarding expert specialization by simply adding a simple constraint, without modifying the conventional model architectur
The experiments only validate the method on a single, very small-scale model instance. It has not been demonstrated whether the method is effective across the wide variety of MoE architectures. Since the experiments involve expensive pre-training, it is understandable that validating on various settings must be forgone due to cost, but it is true that the information provided feels somewhat insufficient. The method includes randomness, which may be a source of training instability, although as
1. The ERC loss is computationally cheap. 2. The experiments show MoE gets considerable gain from this ERC loss. 3. Much analysis and ablation are provided.
1. You might need to compare with Router Orthogonalization Loss in https://yiyan.baidu.com/blog/publication/ERNIE_Technical_Report.pdf, since your loss is somewhat similar to ||(RW_g)TRWg - I||_F, if you assume W_g^TW_g\approx I, it is similar to ||R^TR - I||_F. 2. It seems that this ERC loss can be optimized to 0 when RMS(R) -> 0 or RMS(W_g) -> 0, so will it only serve like weight decay?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing · Advanced Graph Neural Networks
