Improving MoE Compute Efficiency by Composing Weight and Data Sparsity
Maciej Kilian, Oleg Mkrtchyan, Luke Zettlemoyer, Akshat Shrivastava, Armen Aghajanyan

TL;DR
This paper introduces a method to improve compute efficiency in Mixture-of-Experts models by combining weight and data sparsity, using null experts to maintain causality and achieve better training and performance outcomes.
Contribution
It proposes a novel approach to incorporate data sparsity within causal MoE models using null experts, enhancing compute efficiency without violating causality.
Findings
Composing weight and data sparsity improves compute efficiency.
Model learns implicit modality-aware routing, favoring vision tokens.
Achieves better training loss and downstream performance at matched FLOPs.
Abstract
Mixture-of-Experts layers achieve compute efficiency through weight sparsity: each token activates only a subset of experts. Data sparsity, where each expert processes only a subset of tokens, offers a complementary axis. Expert-choice routing implements data sparsity directly but violates causality in autoregressive models, creating train-inference mismatch. We recover data sparsity within causal token-choice MoE by leveraging zero-compute (null) experts within the routing pool. When a token routes to null experts, those slots consume no compute. The standard load balancing objective trains the model to uniformly use all experts (real and null) therefore creating data sparsity in expectation without the causality violations. We evaluate on vision-language model training, where data heterogeneity is pronounced: vision encoders produce many low-information tokens while text tokens are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing
