Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective
Fanqi Yan, Huy Nguyen, Pedram Akbarian, Nhat Ho, Alessandro Rinaldo

TL;DR
This paper theoretically demonstrates that sigmoid self-attention is more sample-efficient than softmax self-attention by modeling the attention mechanism as a mixture of experts, highlighting advantages in computational efficiency and feature focus.
Contribution
It provides a rigorous theoretical comparison showing sigmoid self-attention requires less data to achieve similar approximation accuracy, unlike softmax.
Findings
Sigmoid self-attention has lower sample complexity than softmax.
Representing self-attention as a mixture of experts reveals efficiency differences.
Sigmoid eliminates token competition, improving focus and efficiency.
Abstract
At the core of the popular Transformer architecture is the self-attention mechanism, which dynamically assigns softmax weights to each input token so that the model can focus on the most salient information. However, the softmax structure slows down the attention computation due to its row-wise nature, and it inherently introduces competition among tokens: as the weight assigned to one token increases, the weights of others decrease. This competitive dynamic may narrow the focus of self-attention to a limited set of features, potentially overlooking other informative characteristics. Recent experimental studies have shown that using the element-wise sigmoid function helps eliminate token competition and reduce the computational overhead. Despite these promising empirical results, a rigorous comparison between sigmoid and softmax self-attention mechanisms remains absent in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health Research Topics
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Focus · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings
