Scalable-Softmax Is Superior for Attention

Ken M. Nakanishi

arXiv:2501.19399·cs.CL·February 3, 2025

Scalable-Softmax Is Superior for Attention

Ken M. Nakanishi

PDF

Open Access 1 Repo 10 Models

TL;DR

The paper introduces Scalable-Softmax (SSMax), a novel attention mechanism that maintains focus on key information in long contexts, improving language model performance and length generalization.

Contribution

It proposes SSMax as a replacement for Softmax in attention, enhancing long-context modeling and key information prioritization in Transformer-based models.

Findings

01

SSMax improves long-context attention in language models.

02

Models with SSMax achieve faster loss reduction during pretraining.

03

SSMax enhances length generalization even when introduced after pretraining.

Abstract

The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model's ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gdevos010/Scalable-Softmax
pytorch

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Parallel Computing and Optimization Techniques

MethodsAttention Is All You Need · Softmax · Focus