Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective

Fanqi Yan; Huy Nguyen; Pedram Akbarian; Nhat Ho; Alessandro Rinaldo

arXiv:2502.00281·cs.LG·May 27, 2025

Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective

Fanqi Yan, Huy Nguyen, Pedram Akbarian, Nhat Ho, Alessandro Rinaldo

PDF

Open Access

TL;DR

This paper theoretically demonstrates that sigmoid self-attention is more sample-efficient than softmax self-attention by modeling the attention mechanism as a mixture of experts, highlighting advantages in computational efficiency and feature focus.

Contribution

It provides a rigorous theoretical comparison showing sigmoid self-attention requires less data to achieve similar approximation accuracy, unlike softmax.

Findings

01

Sigmoid self-attention has lower sample complexity than softmax.

02

Representing self-attention as a mixture of experts reveals efficiency differences.

03

Sigmoid eliminates token competition, improving focus and efficiency.

Abstract

At the core of the popular Transformer architecture is the self-attention mechanism, which dynamically assigns softmax weights to each input token so that the model can focus on the most salient information. However, the softmax structure slows down the attention computation due to its row-wise nature, and it inherently introduces competition among tokens: as the weight assigned to one token increases, the weights of others decrease. This competitive dynamic may narrow the focus of self-attention to a limited set of features, potentially overlooking other informative characteristics. Recent experimental studies have shown that using the element-wise sigmoid function helps eliminate token competition and reduce the computational overhead. Despite these promising empirical results, a rigorous comparison between sigmoid and softmax self-attention mechanisms remains absent in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health Research Topics

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Focus · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings