Robustness of Mixtures of Experts to Feature Noise
Dong Sun, Rahul Nittala, Rebekka Burkholz

TL;DR
This paper investigates why Mixture of Experts (MoE) models outperform dense networks under feature noise, showing that sparse expert activation filters noise, leading to better robustness, faster convergence, and improved generalization, supported by both theory and experiments.
Contribution
It provides a theoretical and empirical analysis demonstrating that sparse expert activation in MoEs enhances robustness to feature noise and improves learning efficiency.
Findings
MoEs achieve lower generalization error under feature noise
Sparse activation acts as an effective noise filter
Empirical results confirm robustness and efficiency gains
Abstract
Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an iso-parameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Algorithms · Stochastic Gradient Optimization Techniques
