Reevaluating Theoretical Analysis Methods for Optimization in Deep Learning
Hoang Tran, Qinzi Zhang, Ashok Cutkosky

TL;DR
This paper critically examines the gap between theoretical analysis and practical performance of optimization algorithms in deep learning, introducing empirical metrics to evaluate the validity of theoretical assumptions.
Contribution
It develops new empirical metrics to compare real optimization behavior with theoretical predictions, revealing limitations of smoothness-based analyses in practice.
Findings
Smoothness assumptions often fail in practice
Key identities in convex analysis frequently hold despite non-convexity
Empirical metrics reveal discrepancies between theory and practice
Abstract
There is a significant gap between our theoretical understanding of optimization algorithms used in deep learning and their practical performance. Theoretical development usually focuses on proving convergence guarantees under a variety of different assumptions, which are themselves often chosen based on a rough combination of intuitive match to practice and analytical convenience. In this paper, we carefully measure the degree to which the standard optimization analyses are capable of explaining modern algorithms. To do this, we develop new empirical metrics that compare real optimization behavior with analytically predicted behavior. Our investigation is notable for its tight integration with modern optimization analysis: rather than simply checking high-level assumptions made in the analysis (e.g. smoothness), we also verify key low-level identities used by the analysis to explain…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The organization and presentation of this paper is smooth and clear, and it provides new insights when comparing modern training algorithms SGD and AdamW. 2. This paper illustrates the necessity for developing a new theoretical framework to theoretically analyze optimization algorithms, which is crucial for this deep learning era.
In case I miss something, please correct me if I am wrong. My biggest concern is that the main findings in this paper is not that surprising, i.e., it is expected that in the practical training, the loss landscape is highly non-convex, and thus it is reasonable if the convexity or the smoothness does not hold along the whole optimization path. In a high level point of view, this paper seems to reveal the gap between theory and practice, but this gap has been around since the days of deep learni
- The primary goal of the paper – verifying the inequalities used in optimization theory – is clear and well motivated. - The paper is overall well written. The paper connects high level assumptions to specific inequalities, and each section describes how the corresponding inequality is used in standard analyses and how the paper will attempt to verify them. - The overall results, while not entirely surprising, strongly support the claim that existing analyses are unable to explain convergence i
- [1] proposed "directional smoothness" which is the same as this paper's notion of instantaneous smoothness. They also observed that this quantity can approximate the sharpness. - The experiments and conclusion in section 4.1 are essentially equivalent to those in Cohen et al. 2020;2022. - While not exactly equal to the update correlation (eq. 9), Cohen et al. 2020 Appendix H showed that for SGD, the loss does not decrease in expectation which carries a similar message to section 4.2. - Appendi
- Overall, the main takeaways of this paper seem to be interesting and systematically validated. While some of them may not be entirely new to the readers, the paper effectively connects the dots to provide a clearer overall picture. - The quantities tracked in this paper seem to be well-designed owing to a good abstraction of existing proof techniques.
- Several possible alternatives discussed by the authors for smooth non-convex optimization do not seem to contribute much to the "theoretical understanding of optimization algorithms used in deep learning": a) Weak convexity still might not hold globally; b) The algorithms for non-convex non-smooth optimization (e.g. [1, 2]) might be too complicated to be practical; c) The variants without random scaling of existing algorithms seem to perform on par with the variants with random scaling. [1] Z
- the findings in section 3.2 imply that a popular convexity-based analysis technique cannot apply in deep learning (since the so-called 'convexity ratio' is negative in many deep learning settings) - in section 4, it is shown that a certain measure of smoothness (the relative change in the gradients after each step) is large for deep learning landscapes. This implies that smoothness-based analyses based on this property do not apply in deep learning. This is a more computationally tractable m
- the quantity measured in section 3.1 (which measures whether the loss is convex along the line between successive iterates) does not really enter into convergence analyses. Thus, it's not clear what is the significance of these results for convergence analyses. - the paper's findings regarding smoothness (e.g. that it adapts to the learning rate) are similar to results which have already been reported in prior works, in particular Cohen et al '21. Prior works such as Xing et al '18 have alre
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
