Teacher-Guided Routing for Sparse Vision Mixture-of-Experts

Masahiro Kada; Ryota Yoshihashi; Satoshi Ikehata; Rei Kawakami; Ikuro Sato

arXiv:2604.21330·cs.CV·April 24, 2026

Teacher-Guided Routing for Sparse Vision Mixture-of-Experts

Masahiro Kada, Ryota Yoshihashi, Satoshi Ikehata, Rei Kawakami, Ikuro Sato

PDF

TL;DR

This paper introduces TGR-MoE, a teacher-guided routing method that stabilizes training and improves accuracy in sparse vision Mixture-of-Experts models by leveraging supervision from a pretrained teacher.

Contribution

It proposes a novel teacher-guided routing approach for sparse MoE models, addressing optimization difficulties and enhancing training stability and performance.

Findings

01

TGR-MoE improves accuracy on ImageNet-1K and CIFAR-100.

02

It stabilizes routing dynamics during training.

03

It maintains training stability under highly sparse configurations.

Abstract

Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck. Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of experts for each input, achieving high scalability without sacrificing inference speed. Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives informative gradients only through the experts selected in the forward pass, it suffers from gradient blocking and obtains little information from unselected routes. This limited, highly localized feedback makes it difficult for the router to learn appropriate expert-selection scores and often leads to unstable routing dynamics, such as fluctuating expert assignments during training. To address this issue, we propose TGR-MoE:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.