Robustness of Mixtures of Experts to Feature Noise

Dong Sun; Rahul Nittala; Rebekka Burkholz

arXiv:2601.14792·cs.LG·January 22, 2026

Robustness of Mixtures of Experts to Feature Noise

Dong Sun, Rahul Nittala, Rebekka Burkholz

PDF

Open Access

TL;DR

This paper investigates why Mixture of Experts (MoE) models outperform dense networks under feature noise, showing that sparse expert activation filters noise, leading to better robustness, faster convergence, and improved generalization, supported by both theory and experiments.

Contribution

It provides a theoretical and empirical analysis demonstrating that sparse expert activation in MoEs enhances robustness to feature noise and improves learning efficiency.

Findings

01

MoEs achieve lower generalization error under feature noise

02

Sparse activation acts as an effective noise filter

03

Empirical results confirm robustness and efficiency gains

Abstract

Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an iso-parameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Algorithms · Stochastic Gradient Optimization Techniques