Sparse Mixture-of-Experts for Compositional Generalization: Empirical Evidence and Theoretical Foundations of Optimal Sparsity

Jinze Zhao; Peihao Wang; Junjie Yang; Ruisi Cai; Gaowen Liu; Jayanth Srinivasa; Ramana Rao Kompella; Yingbin Liang; Zhangyang Wang

arXiv:2410.13964·cs.LG·June 17, 2025

Sparse Mixture-of-Experts for Compositional Generalization: Empirical Evidence and Theoretical Foundations of Optimal Sparsity

Jinze Zhao, Peihao Wang, Junjie Yang, Ruisi Cai, Gaowen Liu, Jayanth Srinivasa, Ramana Rao Kompella, Yingbin Liang, Zhangyang Wang

PDF

Open Access

TL;DR

This paper investigates how the sparsity level in Sparse Mixture-of-Experts models affects their ability to generalize compositionally, providing empirical evidence and theoretical insights into optimal expert activation based on task complexity.

Contribution

It offers a theoretical scaling law for optimal sparsity in SMoE models and empirically validates that expert activation scales with task difficulty and complexity.

Findings

01

Optimal expert activation increases with task complexity.

02

Theoretical scaling law aligns with empirical results.

03

Optimal sparsity balances approximation and estimation errors.

Abstract

Sparse Mixture-of-Experts (SMoE) architectures have gained prominence for their ability to scale neural networks, particularly transformers, without a proportional increase in computational cost. Despite their success, their role in compositional generalization, i.e., adapting to novel combinations of known components, remains under-explored. This study challenges the assumption that minimal expert activation suffices for task generalization and investigates the relationship between task complexity and optimal sparsity in SMoE models. Through empirical evaluations on the SRAVEN symbolic reasoning task and the SKILL-MIX benchmark, we demonstrate that (i) the number of activated experts consistently increases with the perceived task difficulty to maintain performance; and (ii) the optimal number of activated experts scales proportionally with task complexity. Our theoretical analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExpert finding and Q&A systems

MethodsAttention Is All You Need · Dense Connections · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Linear Layer · Softmax · Multi-Head Attention · Dropout