Scalable-Softmax Is Superior for Attention
Ken M. Nakanishi

TL;DR
The paper introduces Scalable-Softmax (SSMax), a novel attention mechanism that maintains focus on key information in long contexts, improving language model performance and length generalization.
Contribution
It proposes SSMax as a replacement for Softmax in attention, enhancing long-context modeling and key information prioritization in Transformer-based models.
Findings
SSMax improves long-context attention in language models.
Models with SSMax achieve faster loss reduction during pretraining.
SSMax enhances length generalization even when introduced after pretraining.
Abstract
The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model's ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗mistralai/Devstral-Small-2-24B-Instruct-2512model· 165k dl· ♡ 570165k dl♡ 570
- 🤗unsloth/Devstral-Small-2-24B-Instruct-2512-GGUFmodel· 36k dl· ♡ 12236k dl♡ 122
- 🤗ExaltedSlayer/mistralai-devstral-small-2-24b-instruct-2512-mlx-mxfp4model· 359 dl· ♡ 2359 dl♡ 2
- 🤗unsloth/Devstral-Small-2-24B-Instruct-2512model· 3.4k dl· ♡ 73.4k dl♡ 7
- 🤗cyankiwi/Devstral-Small-2-24B-Instruct-2512-AWQ-4bitmodel· 5.8k dl· ♡ 115.8k dl♡ 11
- 🤗AlexanderKyng/Devstral-Small-2-24B-Instruct-2512-exl3-4.5bpw-optimizedmodel· 8 dl· ♡ 28 dl♡ 2
- 🤗akoumpa/Devstral-Small-2-24B-Instruct-2512-BF16model· 7.5k dl· ♡ 17.5k dl♡ 1
- 🤗androiddrew/Devstral-Small-2-24B-Instruct-2512-AWQ-4bitmodel· 461 dl461 dl
- 🤗professorf/Devstral-Small-2-24B-Instruct-2512-ggufmodel· 44 dl· ♡ 144 dl♡ 1
- 🤗coolroman/affine-1-5FHPjm5fA4AGPGuYdE3jVg7u2Av5KP23G8ELDaNgF8MiNB2pmodel· 7 dl7 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Parallel Computing and Optimization Techniques
MethodsAttention Is All You Need · Softmax · Focus
