Quadratic Gating Mixture of Experts: Statistical Insights into Self-Attention

Pedram Akbarian; Huy Nguyen; Xing Han; Nhat Ho

arXiv:2410.11222·stat.ML·July 10, 2025

Quadratic Gating Mixture of Experts: Statistical Insights into Self-Attention

Pedram Akbarian, Huy Nguyen, Xing Han, Nhat Ho

PDF

Open Access

TL;DR

This paper reveals a fundamental link between mixture of experts and self-attention, providing theoretical insights into gating functions and proposing an active-attention mechanism that improves performance across multiple tasks.

Contribution

It establishes a rigorous relation between MoE and self-attention, analyzes quadratic gating functions, and introduces active-attention with empirical validation.

Findings

01

Quadratic monomial gate improves sample efficiency.

02

Non-linear experts lead to faster estimation rates.

03

Active-attention outperforms standard self-attention in experiments.

Abstract

Mixture of Experts (MoE) models are well known for effectively scaling model capacity while preserving computational overheads. In this paper, we establish a rigorous relation between MoE and the self-attention mechanism, showing that each row of a self-attention matrix can be written as a quadratic gating mixture of linear experts. Motivated by this connection, we conduct a comprehensive convergence analysis of MoE models with two different quadratic gating functions, namely the quadratic polynomial gate and the quadratic monomial gate, offering useful insights into the design of gating and experts for the MoE framework. First, our analysis indicates that the use of the quadratic monomial gate yields an improved sample efficiency for estimating parameters and experts compared to the quadratic polynomial gate. Second, parameter and expert estimation rates become significantly faster…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsForecasting Techniques and Applications

MethodsAttention Is All You Need · Softmax · Mixture of Experts