Improving Routing in Sparse Mixture of Experts with Graph of Tokens
Tam Nguyen, Ngoc N. Tran, Khai Nguyen, Richard G. Baraniuk

TL;DR
This paper introduces a novel approach to improve routing stability in Sparse Mixture of Experts models by leveraging token similarities and attention mechanisms, resulting in more robust and accurate models.
Contribution
The paper proposes the Similarity-Aware and Attention-Aware (S)MoE models that incorporate token interactions to reduce routing fluctuations and improve robustness in SMoE architectures.
Findings
Significant reduction in routing fluctuations.
Enhanced model accuracy across tasks.
Increased robustness compared to baseline MoE-Transformer.
Abstract
Sparse Mixture of Experts (SMoE) has emerged as a key to achieving unprecedented scalability in deep learning. By activating only a small subset of parameters per sample, SMoE achieves an exponential increase in parameter counts while maintaining a constant computational overhead. However, SMoE models are susceptible to routing fluctuations--changes in the routing of a given input to its target expert--at the late stage of model training, leading to model non-robustness. In this work, we unveil the limitation of SMoE through the perspective of the probabilistic graphical model (PGM). Through this PGM framework, we highlight the independence in the expert-selection of tokens, which exposes the model to routing fluctuation and non-robustness. Alleviating this independence, we propose the novel Similarity-Aware (S)MoE, which considers interactions between tokens during expert selection. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Complex Network Analysis Techniques · Expert finding and Q&A systems
