Quadratic Gating Mixture of Experts: Statistical Insights into Self-Attention
Pedram Akbarian, Huy Nguyen, Xing Han, Nhat Ho

TL;DR
This paper reveals a fundamental link between mixture of experts and self-attention, providing theoretical insights into gating functions and proposing an active-attention mechanism that improves performance across multiple tasks.
Contribution
It establishes a rigorous relation between MoE and self-attention, analyzes quadratic gating functions, and introduces active-attention with empirical validation.
Findings
Quadratic monomial gate improves sample efficiency.
Non-linear experts lead to faster estimation rates.
Active-attention outperforms standard self-attention in experiments.
Abstract
Mixture of Experts (MoE) models are well known for effectively scaling model capacity while preserving computational overheads. In this paper, we establish a rigorous relation between MoE and the self-attention mechanism, showing that each row of a self-attention matrix can be written as a quadratic gating mixture of linear experts. Motivated by this connection, we conduct a comprehensive convergence analysis of MoE models with two different quadratic gating functions, namely the quadratic polynomial gate and the quadratic monomial gate, offering useful insights into the design of gating and experts for the MoE framework. First, our analysis indicates that the use of the quadratic monomial gate yields an improved sample efficiency for estimating parameters and experts compared to the quadratic polynomial gate. Second, parameter and expert estimation rates become significantly faster…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForecasting Techniques and Applications
MethodsAttention Is All You Need · Softmax · Mixture of Experts
