A Statistical Theory of Gated Attention through the Lens of Hierarchical Mixture of Experts
Viet Nguyen, Tuan Minh Pham, Thinh Cao, Tan Dinh, Huy Nguyen, Nhat Ho, Alessandro Rinaldo

TL;DR
This paper provides a theoretical framework showing that gated attention in Transformers can be viewed as a hierarchical mixture of experts, explaining its efficiency and guiding optimal placement within the architecture.
Contribution
It introduces a rigorous theoretical analysis of gated attention, revealing its equivalence to a hierarchical mixture of experts and its sample efficiency advantages.
Findings
Gated attention entries can be expressed as hierarchical mixtures of experts.
Gated attention is more sample-efficient than standard multi-head self-attention.
Optimal placement of gates enhances model performance.
Abstract
Self-attention has greatly contributed to the success of the widely used Transformer architecture by enabling learning from data with long-range dependencies. In an effort to improve performance, a gated attention model that leverages a gating mechanism within the multi-head self-attention has recently been proposed as a promising alternative. Gated attention has been empirically demonstrated to increase the expressiveness of low-rank mapping in standard attention and even to eliminate the attention sink phenomenon. Despite its efficacy, a clear theoretical understanding of gated attention's benefits remains lacking in the literature. To close this gap, we rigorously show that each entry in a gated attention matrix or a multi-head self-attention matrix can be written as a hierarchical mixture of experts. By recasting learning as an expert estimation problem, we demonstrate that gated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection · Stochastic Gradient Optimization Techniques
