Self-Adjust Softmax
Chuanyang Zheng, Yihang Gao, Guoxuan Chen, Han Shi, Jing Xiong,, Xiaozhe Ren, Chao Huang, Xin Jiang, Zhenguo Li, Yu Li

TL;DR
This paper introduces Self-Adjust Softmax (SA-Softmax), a modified softmax function designed to improve gradient flow in Transformer attention, with theoretical benefits and empirical validation across large-scale models and diverse tasks.
Contribution
The paper proposes SA-Softmax, a novel softmax variant that enhances gradient properties and can be easily integrated into Transformer models.
Findings
SA-Softmax improves gradient flow in attention mechanisms.
Models with SA-Softmax outperform those with vanilla softmax.
Empirical results on large models show consistent performance gains.
Abstract
The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one, achieving superior performances over other alternative functions. However, the softmax function can face a gradient vanishing issue when some elements of the attention scores approach extreme values, such as probabilities close to one or zero. In this paper, we propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying to and its normalized variant . We theoretically show that SA-Softmax provides enhanced gradient properties compared to the vanilla softmax function. Moreover, SA-Softmax Attention can be seamlessly integrated into existing Transformer models to their attention mechanisms with minor adjustments. We conducted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Big Data and Digital Economy
MethodsAttention Is All You Need · Absolute Position Encodings · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer
