Exploring the Frontiers of Softmax: Provable Optimization, Applications in Diffusion Model, and Beyond

Yang Cao; Yingyu Liang; Zhenmei Shi; Zhao Song

arXiv:2405.03251·cs.LG·January 27, 2026·2 cites

Exploring the Frontiers of Softmax: Provable Optimization, Applications in Diffusion Model, and Beyond

Yang Cao, Yingyu Liang, Zhenmei Shi, Zhao Song

PDF

Open Access 3 Reviews

TL;DR

This paper offers a theoretical analysis of softmax neural networks, revealing their optimization advantages and demonstrating their effectiveness in diffusion models, thereby enhancing understanding of their success in large language models and generative tasks.

Contribution

It provides a novel theoretical framework using NTK to explain softmax's advantages and applies these insights to score estimation in diffusion models, showing provable learning guarantees.

Findings

01

Softmax induces a good perturbation property in NTK matrices.

02

Softmax neural networks can learn target functions in over-parameterized regimes.

03

Gradient algorithms can accurately learn score functions in diffusion models.

Abstract

The softmax activation function plays a crucial role in the success of large language models (LLMs), particularly in the self-attention mechanism of the widely adopted Transformer architecture. However, the underlying learning dynamics that contribute to the effectiveness of softmax remain largely unexplored. As a step towards better understanding, this paper provides a theoretical study of the optimization and generalization properties of two-layer softmax neural networks, providing theoretical insights into their superior performance as other activation functions, such as ReLU and exponential. Leveraging the Neural Tangent Kernel (NTK) framework, our analysis reveals that the normalization effect of the softmax function leads to a good perturbation property of the induced NTK matrix, resulting in a good convex region of the loss landscape. Consequently, softmax neural networks can…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. Under the NTK framework, the authors establish the convergence rate for training softmax transformers using gradient descent. 2. By leveraging the connection with score matching in diffusion models and multi-label regression, the authors obtain the convergence rate of training score functions.

Weaknesses

While technical ideas make sense to me, I still have the following concerns: 1. Some literature shows that softmax transformer has better sequence length dependence compared with ReLU networks. Do you results also support this point? \ 2. Any technical difficulties when applying the NTK techniques for softmax activation compared to ReLU networks? This is not clearly demonstrated in the manuscript.

Reviewer 02Rating 2Confidence 4

Strengths

- The paper is theoretically grounded.

Weaknesses

- The paper is unnecessarily over-complicated. - In fact, LLMs can be the use case of the proposed work, but there seems to be no need to give a long explanation of LLMs. I recommend removing Sections 2.2 and 2.3 as well. Instead, the authors can explain some works, such as Munteanu et al. (2022), in the related work section, since it is more directly linked to the proposed work. - In Table 1, I disagree with Line 57-59: We can see that ... For example, $n^2$ and $n^{2+o(1)}$ can be hugely dif

Reviewer 03Rating 4Confidence 2

Strengths

The paper studies properties of two-layer softmax NN. The topic is worth exploring since softmax is playing important role in modern AI systems. The authors have established convergence result in Theorem 4.2, based on NTK theoretic tools. The authors have also extended their theoretic results to diffusion models, and establish convergence there.

Weaknesses

It seems that the main contribution of the current paper is establishing convergence results for two-layer softmax NN and extends the result to diffusion model. The analysis is standard NTK analysis (though in Section 5.1the authors explicitly explain what are the new challenges in their theoretic derivation). Also I personally feel the paper can largely benefit from adding more explanation, discussion, and comparisons. The current presentation is too notation-heavy, and lacks explanation of in

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Thermodynamics and Statistical Mechanics

MethodsAttention Is All You Need · Dense Connections · Dropout · Label Smoothing · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Neural Tangent Kernel · Absolute Position Encodings