A Study on ReLU and Softmax in Transformer
Kai Shen, Junliang Guo, Xu Tan, Siliang Tang, Rui Wang, Jiang Bian

TL;DR
This paper explores the relationship between ReLU and Softmax activations in Transformers, revealing their equivalence under certain conditions, and introduces ReLUFormer, a ReLU-based architecture that improves long sequence processing.
Contribution
It demonstrates the equivalence of FFN and key-value memory with ReLU and Softmax, and proposes ReLUFormer, a new architecture enhancing long sequence tasks.
Findings
ReLU outperforms Softmax with many key-value slots.
ReLU and Softmax are equivalent with layer normalization.
ReLUFormer improves long sequence task performance.
Abstract
The Transformer architecture consists of self-attention and feed-forward networks (FFNs) which can be viewed as key-value memories according to previous works. However, FFN and traditional memory utilize different activation functions (i.e., ReLU and Softmax respectively), which makes them not equivalent. In this paper, we first rebuild the connections between FFN and key-value memory by conducting extensive studies on ReLU and Softmax, and find they are equivalent when adding an additional layer normalization module on Softmax. In addition, ReLU outperforms Softmax on both FFN and key-value memory when the number of value slots is large. We analyze the reasons and then explore this good property of ReLU on the self-attention network where the original Softmax activation performs poorly on long input sequences. We then propose a full ReLU architecture named ReLUFormer which performs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Neural Networks and Reservoir Computing · Machine Learning and ELM
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax · Label Smoothing
