Grouter: Decoupling Routing from Representation for Accelerated MoE Training
Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, Kun Yuan

TL;DR
Grouter introduces a preemptive routing approach for Mixture-of-Experts models that decouples routing from training, significantly improving convergence speed and training efficiency.
Contribution
It proposes Grouter, a novel fixed routing method derived from fully-trained models, enabling faster and more stable MoE training.
Findings
Boosts pre-training data utilization by 4.28x
Achieves up to 33.5% throughput acceleration
Demonstrates superior performance and efficiency in experiments
Abstract
Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while searching for an optimal routing policy within a vast combinatorial space. This entanglement often leads to sluggish convergence and training instabilities. This paper introduces Grouter, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models. By decoupling structural optimization from weight updates, Grouter significantly accelerates both the speed and quality of model convergence. To ensure the framework's versatility, we also introduce expert folding to adapt Grouter across varying model configurations and expert tuning to rebalance workloads across different data distributions. Furthermore, by leveraging the structural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
