Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
Masahiro Kada, Ryota Yoshihashi, Satoshi Ikehata, Rei Kawakami, Ikuro Sato

TL;DR
This paper introduces TGR-MoE, a teacher-guided routing method that stabilizes training and improves accuracy in sparse vision Mixture-of-Experts models by leveraging supervision from a pretrained teacher.
Contribution
It proposes a novel teacher-guided routing approach for sparse MoE models, addressing optimization difficulties and enhancing training stability and performance.
Findings
TGR-MoE improves accuracy on ImageNet-1K and CIFAR-100.
It stabilizes routing dynamics during training.
It maintains training stability under highly sparse configurations.
Abstract
Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck. Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of experts for each input, achieving high scalability without sacrificing inference speed. Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives informative gradients only through the experts selected in the forward pass, it suffers from gradient blocking and obtains little information from unselected routes. This limited, highly localized feedback makes it difficult for the router to learn appropriate expert-selection scores and often leads to unstable routing dynamics, such as fluctuating expert assignments during training. To address this issue, we propose TGR-MoE:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
