Self-Adjust Softmax

Chuanyang Zheng; Yihang Gao; Guoxuan Chen; Han Shi; Jing Xiong,; Xiaozhe Ren; Chao Huang; Xin Jiang; Zhenguo Li; Yu Li

arXiv:2502.18277·cs.CL·February 26, 2025

Self-Adjust Softmax

Chuanyang Zheng, Yihang Gao, Guoxuan Chen, Han Shi, Jing Xiong,, Xiaozhe Ren, Chao Huang, Xin Jiang, Zhenguo Li, Yu Li

PDF

Open Access 1 Models

TL;DR

This paper introduces Self-Adjust Softmax (SA-Softmax), a modified softmax function designed to improve gradient flow in Transformer attention, with theoretical benefits and empirical validation across large-scale models and diverse tasks.

Contribution

The paper proposes SA-Softmax, a novel softmax variant that enhances gradient properties and can be easily integrated into Transformer models.

Findings

01

SA-Softmax improves gradient flow in attention mechanisms.

02

Models with SA-Softmax outperform those with vanilla softmax.

03

Empirical results on large models show consistent performance gains.

Abstract

The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one, achieving superior performances over other alternative functions. However, the softmax function can face a gradient vanishing issue when some elements of the attention scores approach extreme values, such as probabilities close to one or zero. In this paper, we propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying $so f t ma x (x)$ to $x \cdot so f t ma x (x)$ and its normalized variant $\frac{( x - min ( x _{m i n} , 0 ))}{ma x ( 0 , x _{ma x} ) - min ( x _{min} , 0 )} \cdot so f t ma x (x)$ . We theoretically show that SA-Softmax provides enhanced gradient properties compared to the vanilla softmax function. Moreover, SA-Softmax Attention can be seamlessly integrated into existing Transformer models to their attention mechanisms with minor adjustments. We conducted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Gausson/gpt-neox-125m-deduped-SA
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Big Data and Digital Economy

MethodsAttention Is All You Need · Absolute Position Encodings · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer