Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Ang Lv; Jin Ma; Yiyuan Ma; Siyuan Qiao

arXiv:2512.23447·cs.CL·February 25, 2026

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces an auxiliary loss called expert-router coupling (ERC) for Mixture-of-Experts models, which improves expert specialization and model performance by explicitly aligning router decisions with expert capabilities.

Contribution

We propose a novel, efficient auxiliary loss that tightly couples router decisions with expert capabilities, enhancing MoE training and interpretability.

Findings

01

ERC loss improves expert specialization in MoE models.

02

The method is computationally efficient, with fixed cost independent of batch size.

03

Pre-training on large models demonstrates effectiveness across multiple scales.

Abstract

Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain intermediate activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 2

Strengths

The method is simple, intuitive, and clearly presented. The experiments and evaluations are thorough and the auxiliary loss does appear to improve performance.

Weaknesses

The activation metric is not scale invariant, the auxiliary loss can be decreased in a non-meaningful manner simply by scaling up $W_g^i$. The auxiliary appears to make the gradient dense across experts since activations norms for each token are computed across experts.

Reviewer 02Rating 8Confidence 4

Strengths

Many recent MoE papers focus on improving routing. Methods exist to resolve the mismatch between routing and experts, such as adding an auxiliary loss to teach desired properties in expert specialization, or modifying the model architecture, like AoE (which is also adopted as a baseline in this paper). The proposed method belongs to the former category; it achieves its goals regarding expert specialization by simply adding a simple constraint, without modifying the conventional model architectur

Weaknesses

The experiments only validate the method on a single, very small-scale model instance. It has not been demonstrated whether the method is effective across the wide variety of MoE architectures. Since the experiments involve expensive pre-training, it is understandable that validating on various settings must be forgone due to cost, but it is true that the information provided feels somewhat insufficient. The method includes randomness, which may be a source of training instability, although as

Reviewer 03Rating 6Confidence 3

Strengths

1. The ERC loss is computationally cheap. 2. The experiments show MoE gets considerable gain from this ERC loss. 3. Much analysis and ablation are provided.

Weaknesses

1. You might need to compare with Router Orthogonalization Loss in https://yiyan.baidu.com/blog/publication/ERNIE_Technical_Report.pdf, since your loss is somewhat similar to ||(RW_g)TRWg - I||_F, if you assume W_g^TW_g\approx I, it is similar to ||R^TR - I||_F. 2. It seems that this ERC loss can be optimized to 0 when RMS(R) -> 0 or RMS(W_g) -> 0, so will it only serve like weight decay?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing · Advanced Graph Neural Networks