Grouter: Decoupling Routing from Representation for Accelerated MoE Training

Yuqi Xu; Rizhen Hu; Zihan Liu; Mou Sun; Kun Yuan

arXiv:2603.06626·cs.LG·March 10, 2026

Grouter: Decoupling Routing from Representation for Accelerated MoE Training

Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, Kun Yuan

PDF

Open Access

TL;DR

Grouter introduces a preemptive routing approach for Mixture-of-Experts models that decouples routing from training, significantly improving convergence speed and training efficiency.

Contribution

It proposes Grouter, a novel fixed routing method derived from fully-trained models, enabling faster and more stable MoE training.

Findings

01

Boosts pre-training data utilization by 4.28x

02

Achieves up to 33.5% throughput acceleration

03

Demonstrates superior performance and efficiency in experiments

Abstract

Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while searching for an optimal routing policy within a vast combinatorial space. This entanglement often leads to sluggish convergence and training instabilities. This paper introduces Grouter, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models. By decoupling structural optimization from weight updates, Grouter significantly accelerates both the speed and quality of model convergence. To ensure the framework's versatility, we also introduce expert folding to adapt Grouter across varying model configurations and expert tuning to rebalance workloads across different data distributions. Furthermore, by leveraging the structural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis