Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models
Bo Gao, Michael W. Spratling, Letizia Gionfrida

TL;DR
This paper introduces a two-stage attention mechanism replacing Softmax with Softplus and re-weighting to improve numerical stability and length extrapolation in large language models, enabling better long-context understanding and physical modeling.
Contribution
It proposes a novel two-stage attention design with Softplus normalization and re-weighting, significantly enhancing length extrapolation and stability over traditional Softmax attention.
Findings
Outperforms Softmax and Softmax-free attention methods in length extrapolation
Maintains low validation loss at 16x training length
Enables models to recover physical laws from data
Abstract
Large language models have achieved remarkable success in recent years, primarily due to self-attention. However, traditional Softmax attention suffers from numerical instability and reduced performance as the number of inference tokens increases. This work addresses these issues by proposing a new design principle for attention, viewing it as a two-stage process. The first stage (normalisation) refines standard attention by replacing Softmax with the more numerically stable Softplus followed by -normalisation. Furthermore, we introduce a dynamic scale factor based on invariance entropy. We show that this novel attention mechanism outperforms conventional Softmax attention, and state-of-the-art Softmax-free alternatives. Our second proposal is to introduce a second processing stage (sharpening) which consists of a re-weighting mechanism that amplifies significant attentional…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The decomposition of Softmax operation into none-linear positive transformation and l1-norm is conceptually direct and simple to understand. The re-interpretation helps unify multiple Softmax-free attention variants under a coherent framework. 2. Supported by both quantitative results and visualization analyses, the proposed method shows stronger abilities in improving length extrapolation and reducing attention sink, suggesting that the two-stage normalization and re-weighting design offers
1. While the proposed method (LSSA/LSSAR) is conceptually solid and supported by several illustrative experiments, evaluation across a broader set of backbones is necessary to demonstrate robustness and effectiveness beyond a single model family. More baselines should be added to long-context tasks and downstream evaluation. 2. Since efficiency is central to practical deployment, the measurements of runtime or memory profiling relative to optimized kernels are critical for completeness.
1. The paper presents a clear motivation, addressing the length extrapolation limitations of Softmax attention. The authors demonstrate a solid understanding of the key factors affecting extrapolation, such as maintaining entropy invariance, and mitigating attention over-smoothing and distraction. 2. The proposed method, LSSAR, is conceptually simple and easy to implement, while being supported by a reasonable set of experiments. 3. The proposed method exhibits strong extrapolation performance
1. **Baseline implementation details are insufficient.** In current mainstream LLM implementations (e.g., Qwen3, Gemma3, etc.), Softmax attention typically incorporates QK-Norm, and employs NTK/Yarn or length-scaling factors ($\alpha\log{L}$) in out-of-context settings—techniques that have been widely validated to enhance extrapolation. It is unclear whether these enhancements were applied to the Softmax baselines in this paper, while the proposed LSSAR explicitly includes both QK-Norm and two l
This paper is clearly written. The authors first decompose Softmax into two steps, then argue that the second step (normalization) is more important. The paper focuses on discussing the first step and improves the activation function (exp), resulting in LSSA and its re-weighting version LSSAR.
1. The paper evaluates models of limited scale. The authors only discuss a single model size, but should examine larger models. Based on my experience, a token consumption of approximately 10 billion should be feasible on the authors' GPU infrastructure. 2. The paper lacks discussion on efficient kernel implementations. I am not requesting the authors to implement Triton/CUDA-level kernels, but rather expect a discussion of the mathematical derivation of efficient algorithms. For example, algori
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · (TravEL!!Guide)How Do I File a Claim with Expedia? · Softmax
