Statistical Perspective of Top-K Sparse Softmax Gating Mixture of   Experts

Huy Nguyen; Pedram Akbarian; Fanqi Yan; Nhat Ho

arXiv:2309.13850·stat.ML·February 27, 2024·1 cites

Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

Huy Nguyen, Pedram Akbarian, Fanqi Yan, Nhat Ho

PDF

Open Access 1 Video

TL;DR

This paper provides a theoretical analysis of the top-K sparse softmax gating mixture of experts, revealing how it affects density and parameter estimation, especially in over-specified models, with convergence rates depending on expert number and input region complexity.

Contribution

It offers the first theoretical insights into the behavior of top-K sparse softmax gating in mixture of experts, including convergence rates and conditions for accurate estimation.

Findings

01

Density estimation converges at a parametric rate when the true number of experts is known.

02

Over-specified models require selecting more experts than the true number to ensure density estimation convergence.

03

Parameter estimation rates slow down significantly in over-specified models due to gating-expert interactions.

Abstract

Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitions the input space into multiple regions with distinct behaviors. By focusing on a Gaussian mixture of experts, we establish theoretical results on the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. When the true number of experts $k_{*}$ is known, we demonstrate that the convergence rates of density and parameter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts· slideslive

Taxonomy

TopicsNeural Networks and Applications · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques

MethodsSoftmax