Convergence of Adam in Deep ReLU Networks via Directional Complexity and Kakeya Bounds
Anupama Sridhar, Alexander Johansen

TL;DR
This paper provides the first theoretical convergence and generalization bounds for Adam optimizer in Deep ReLU networks, leveraging stratified Morse theory and Kakeya bounds to address non-smooth, non-convex landscapes.
Contribution
It introduces novel theoretical analysis techniques for Adam in non-smooth deep networks, including convergence to global optima and improved generalization bounds.
Findings
First ((\sqrt{d_{ ext{eff}}/n})) generalization bound for Adam in Deep ReLU networks.
Global-optimal convergence of Adam in non-smooth, non-convex ReLU landscapes without PL or convexity assumptions.
Number of region crossings reduces from exponential to near-linear in effective dimension.
Abstract
First-order adaptive optimization methods like Adam are the default choices for training modern deep neural networks. Despite their empirical success, the theoretical understanding of these methods in non-smooth settings, particularly in Deep ReLU networks, remains limited. ReLU activations create exponentially many region boundaries where standard smoothness assumptions break down. \textbf{We derive the first \(\tilde{O}\!\bigl(\sqrt{d_{\mathrm{eff}}/n}\bigr)\) generalization bound for Adam in Deep ReLU networks and the first global-optimal convergence for Adam in the non smooth, non convex relu landscape without a global PL or convexity assumption.} Our analysis is based on stratified Morse theory and novel results in Kakeya sets. We develop a multi-layer refinement framework that progressively tightens bounds on region crossings. We prove that the number of region crossings collapses…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper provides the global convergence theorem for Adam in deep ReLU networks under non-smooth, non-convex conditions, without NTK linearization or convexity assumptions. - The paper presents a framework combining stratified Morse theory and Kakeya sets, which offers a fresh perspective on optimization dynamics. - By leveraging assumptions (L1-L7), which are motivated by empirical observations, this paper reduces region-crossing complexity from $O(N^d)$ to $O(d_{\mathrm{eff}}\log N)$, co
- The derived result is quite abstract and dense. The manuscript would benefit greatly from more intuitive explanations, perhaps accompanied by simple diagrams or illustrative examples, to help the reader build a conceptual understanding of the core geometric arguments. - While the paper mentions that assumptions (L1-L7) are ``motivated by empirical findings" and provides brief citations. it lacks detailed, convincing explanations or specific empirical results to support these claims. - The em
- The paper tackles a challenging and timely theoretical question about the convergence and generalization behavior of Adam in non-smooth deep learning settings. - The technical development is ambitious, combining ideas from geometry, Morse theory, and optimization in an original way. - The idea of handling non-smoothness through hyperplane-crossing analysis is neat and conceptually appealing. - I believe the work could be of large interest to the optimization theory and deep learning theory com
While I believe the paper has the potential to be a strong contribution to our understanding of the Adam optimizer’s behavior, I have several concerns regarding the writing, rigor, and correctness of the work. - I understand that the formulations need to be written densely to fit within the page limit. However, in several places, intermediate explanatory text is missing, which interrupts the reading flow and makes the paper difficult to follow. For example, there is no contextual text preceding
Original attempt at analyzing the generalization and optimization of ReLU networks trained with Adam. The idea is indeed novel and has the potential to bring about important results.
### 1. Absence of references for technical results used This paper leverages many very technical existing results. However, they are almost always mentionned without appropriate refernces! A few examples: - l679: "Goresky-MacPherson" - l692: "Whitney regularity implies..." - l816: "Zaslavsky's theorem..." - l846: "Stratified Morse theory gives the bound..." - l854: "Smith theory implies..." - l1782: "Applying Freedman's inequality..." - l1882: "Standard Adam descent lemma" Moreover, a few ref
1. Studying the convergence of optimization algorithms on deep ReLU neural networks is a challenging and meaning task 2. The paper performs the analysis from the novel perspective of the regions separated by the ReLU decision boundary, and utilizes advanced mathematical tools such as Kakeya bound and Whitney fans.
1. The paper makes too many assumptions for the analysis, with most of them being not realistic. For instance, the paper requires a bounded norm of the layer weights and the stochastic gradient throughout training. This is something that should be established theoretically in a convergence proof. Moreover, the paper assumes L1-L7 together with an additional assumptions of the uniform low-barrier property given in the appendix. Most of these assumptions is precisely the technical difficulty in th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Stochastic Gradient Optimization Techniques · Face and Expression Recognition
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Adam · Network On Network
