Statistical Guarantees for Approximate Stationary Points of Shallow Neural Networks
Mahsa Taheri, Fang Xie, Johannes Lederer

TL;DR
This paper provides statistical guarantees for shallow neural networks at stationary points, bridging the gap between theoretical global optima and practical local solutions, applicable to both linear and ReLU networks.
Contribution
It introduces statistical guarantees for stationary points of shallow neural networks, extending theoretical understanding closer to real-world neural network training.
Findings
Guarantees apply to stationary points and nearby points.
Results match global optima up to logarithmic factors.
Extends to shallow ReLU networks with nearly identical first layer weights.
Abstract
Since statistical guarantees for neural networks are usually restricted to global optima of intricate objective functions, it is unclear whether these theories explain the performances of actual outputs of neural network pipelines. The goal of this paper is, therefore, to bring statistical theory closer to practice. We develop statistical guarantees for shallow linear neural networks that coincide up to logarithmic factors with the global optima but apply to stationary points and the points nearby. These results support the common notion that neural networks do not necessarily need to be optimized globally from a mathematical perspective. We then extend our statistical guarantees to shallow ReLU neural networks, assuming the first layer weight matrices are nearly identical for the stationary network and the target. More generally, despite being limited to shallow neural networks for…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper studies whether finding stationary points is sufficient for good performance. This is an interesting problem. - The paper is written clearly, especially in the appendix where each step in the proofs is described well.
1. _Lack of emphasis on the boundedness assumption_: The paper uses the term reasonable stationary points in its theorem statements to indicate that the weight matrices have bounded norm. The boundedness is used repeatedly in the proofs and is the main ingredient needed for the proofs. The term "reasonable stationary point" leads the reader to believe that stationarity is an important component when only the boundedness is. The paper can be improved if the authors are clear about this fact. 2. _
This paper focused on approximate stationary points instead of global minima, which is more relevant to practical deep learning settings. By generalizing the result to ReLU networks, the significance of this result is enhanced. The rigorous mathematical proofs make the result quite solid.
1. Some of the expressions in mathematical statements are a bit ambiguous. For example, there are "$\approx$" used in Assumption 2 and Theorem 3. I think using such notation in the explanation part is ok but in mathematical statements it should be more rigorous. I can't even find the rigorous expressions in appendix. Besides, "the second and third parts of Assumption 1 and Assumption 2" in Theorem 3 is not a precise sentence and may cause ambiguity. 2. The assumption of the part of ReLU nets see
**Clarity:** The paper is generally well-organized, with a clear progression from theoretical background to the main results and implications. Each section is introduced with clear motivations, and notations are consistently defined. **Significance:** This work provides crucial theoretical support for the practical success of neural networks, showing that networks trained to stationary points can perform nearly as well as those trained to global minima. However, it is important to note that the
**Empirical Limitations:** The empirical results are confined to small-scale simulations and toy models, which may limit the perceived robustness of the theoretical findings. The paper would benefit from experiments on more complex datasets. **Assumptions for Shallow ReLU Networks:** The assumption of a nearly identity first-layer weight matrix for ReLU networks may restrict the practical relevance of Theorem 3. While the authors acknowledge this limitation, a deeper discussion of how these res
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Stochastic Gradient Optimization Techniques
