Exploring the Frontiers of Softmax: Provable Optimization, Applications in Diffusion Model, and Beyond
Yang Cao, Yingyu Liang, Zhenmei Shi, Zhao Song

TL;DR
This paper offers a theoretical analysis of softmax neural networks, revealing their optimization advantages and demonstrating their effectiveness in diffusion models, thereby enhancing understanding of their success in large language models and generative tasks.
Contribution
It provides a novel theoretical framework using NTK to explain softmax's advantages and applies these insights to score estimation in diffusion models, showing provable learning guarantees.
Findings
Softmax induces a good perturbation property in NTK matrices.
Softmax neural networks can learn target functions in over-parameterized regimes.
Gradient algorithms can accurately learn score functions in diffusion models.
Abstract
The softmax activation function plays a crucial role in the success of large language models (LLMs), particularly in the self-attention mechanism of the widely adopted Transformer architecture. However, the underlying learning dynamics that contribute to the effectiveness of softmax remain largely unexplored. As a step towards better understanding, this paper provides a theoretical study of the optimization and generalization properties of two-layer softmax neural networks, providing theoretical insights into their superior performance as other activation functions, such as ReLU and exponential. Leveraging the Neural Tangent Kernel (NTK) framework, our analysis reveals that the normalization effect of the softmax function leads to a good perturbation property of the induced NTK matrix, resulting in a good convex region of the loss landscape. Consequently, softmax neural networks can…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Under the NTK framework, the authors establish the convergence rate for training softmax transformers using gradient descent. 2. By leveraging the connection with score matching in diffusion models and multi-label regression, the authors obtain the convergence rate of training score functions.
While technical ideas make sense to me, I still have the following concerns: 1. Some literature shows that softmax transformer has better sequence length dependence compared with ReLU networks. Do you results also support this point? \ 2. Any technical difficulties when applying the NTK techniques for softmax activation compared to ReLU networks? This is not clearly demonstrated in the manuscript.
- The paper is theoretically grounded.
- The paper is unnecessarily over-complicated. - In fact, LLMs can be the use case of the proposed work, but there seems to be no need to give a long explanation of LLMs. I recommend removing Sections 2.2 and 2.3 as well. Instead, the authors can explain some works, such as Munteanu et al. (2022), in the related work section, since it is more directly linked to the proposed work. - In Table 1, I disagree with Line 57-59: We can see that ... For example, $n^2$ and $n^{2+o(1)}$ can be hugely dif
The paper studies properties of two-layer softmax NN. The topic is worth exploring since softmax is playing important role in modern AI systems. The authors have established convergence result in Theorem 4.2, based on NTK theoretic tools. The authors have also extended their theoretic results to diffusion models, and establish convergence there.
It seems that the main contribution of the current paper is establishing convergence results for two-layer softmax NN and extends the result to diffusion model. The analysis is standard NTK analysis (though in Section 5.1the authors explicitly explain what are the new challenges in their theoretic derivation). Also I personally feel the paper can largely benefit from adding more explanation, discussion, and comparisons. The current presentation is too notation-heavy, and lacks explanation of in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Thermodynamics and Statistical Mechanics
MethodsAttention Is All You Need · Dense Connections · Dropout · Label Smoothing · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Neural Tangent Kernel · Absolute Position Encodings
