Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Jason Ramapuram; Federico Danieli; Eeshan Dhekane; Floris Weers; Dan; Busbridge; Pierre Ablin; Tatiana Likhomanenko; Jagrit Digani; Zijin Gu,; Amitis Shidani; Russ Webb

arXiv:2409.04431·cs.LG·January 23, 2025

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan, Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu,, Amitis Shidani, Russ Webb

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper provides a comprehensive theoretical and empirical analysis of sigmoid attention in transformers, demonstrating its universality, stability, and practical efficiency, and establishing best practices for its use as a softmax alternative.

Contribution

It proves sigmoid attention's universality and improved regularity, introduces a hardware-efficient implementation, and offers best practices for its integration in transformers.

Findings

01

Sigmoid attention is a universal function approximator.

02

Proper normalization stabilizes training and improves performance.

03

The proposed implementation speeds up inference by 17% on H100 GPUs.

Abstract

Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 5

Strengths

The paper is very detailed analysis of sigmoid attention vs softmax attention. The sigmoid attention is not very original, but authors did a solid job by exploring its theoretcial foundations and execution solid ablation study to decide if sigmoid attention is viable replacement for soft-max attention . The first theoretical contribution - the proof that sigmoid attention can be used as universal approximation to continuous function- is executed well, but has limited novelty, and not very in

Weaknesses

The main weakness of the paper that it doesn't answer main question "Why should we switch from original softmax attention to sigmoid attention"?: - will Sigmoid-Attention be more stable than original Softmax Attention? - will it help with long context? On the positive side, thanks to flash-sigmoid we can see some speed-up for LLM inference (Table 3) More details: The paper started with explanation that original "softmax in SoftmaxAttn is not without limitations. For instance, the softma

Reviewer 02Rating 6Confidence 4

Strengths

**Originality:** This paper provides in-depth mathematical analysis of sigmoid attention specially in Universal Approximation Property and Regularity. Also, authors identify stabilization of large initial attention norms during the early stages of training as a crucial factor for sigmoid attention. **Quality:** This paper provides detailed mathematical proofs and very extensive experiments and ablation study on sigmoid attention including supervised image classification, self-supervised image

Weaknesses

1. In Sec 3.2, the authors state that the Lipschitz constant provides insight into the robustness of the network and the ease of optimizing it. They then present a theorem stating that, in $\mathbb{R}^2$, the local Lipschitz constant of SigmoidAttn is much lower than the worst local Lipschitz constant of SoftmaxAttn. This suggests that sigmoid attention should be easier to train than softmax. However, this is inconsistent with the experiments presented by the authors in Fig. 2, Fig. 3, and Fig.

Reviewer 03Rating 5Confidence 5

Strengths

**Originality:** The paper gives a good in depth analysis of the use of the sigmoid function as an activation for attention and provides an original perspective of how it should be used. I applaud the authors for taking the time to give such a detailed analysis. **Quality:** The quality of the paper is good clearly showing that the authors have taken the time to give the reader a clear understanding of sigmoid based attention mechanisms in transformer architectures. Furthermore, they undertake

Weaknesses

**Novelty:** The main issue I have with the paper is its novelty. Although the authors give an in depth analysis of sigmoid attention I don't see it adding value to the community as their results often show that it only does comparable to softmax. The main novelty of the paper I feel is that they develop FlashSigmoid but I feel this is more of an engineering feat and is not enough for a paper to be accepted into ICLR. I applaud the authors for their in depth analysis and their various ablations

Code & Models

Repositories

apple/ml-sigmoid-attention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIdentity, Memory, and Therapy

MethodsAttention Is All You Need · Softmax · *Communicated@Fast*How Do I Communicate to Expedia?