Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture   of Experts

Huy Nguyen; Nhat Ho; Alessandro Rinaldo

arXiv:2405.13997·stat.ML·November 5, 2024

Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts

Huy Nguyen, Nhat Ho, Alessandro Rinaldo

PDF

Open Access 1 Video

TL;DR

This paper provides a theoretical analysis showing that sigmoid gating in mixture of experts models is more sample efficient and converges faster than softmax gating, especially when using common neural network activations.

Contribution

It offers a rigorous theoretical comparison of sigmoid versus softmax gating, demonstrating the superior sample efficiency of sigmoid gating in expert estimation.

Findings

01

Sigmoid gating achieves faster convergence rates than softmax gating.

02

Sigmoid gating requires fewer samples to reach the same estimation error.

03

Theoretical analysis confirms empirical advantages of sigmoid gating.

Abstract

The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, the softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and has been demonstrated empirically to achieve superior performance. However, a rigorous examination of the sigmoid gating function is lacking in current literature. In this paper, we verify theoretically that the sigmoid gating, in fact, enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation. Towards that goal, we consider a regression framework in which the unknown regression function is modeled as a mixture of experts, and study the rates of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts· slideslive

Taxonomy

TopicsForecasting Techniques and Applications

MethodsSoftmax