Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate
Zhiqi Bu, Shiyun Xu, Jialin Mao

TL;DR
This paper explores how deep learning loss landscapes exhibit convex-like properties, enabling the derivation of scaling laws for loss and learning rate that improve training efficiency across various models and tasks.
Contribution
It introduces a framework to analyze deep learning dynamics through convexity and Lipschitz continuity, leading to new scaling laws for loss and learning rate extrapolation.
Findings
Deep learning loss becomes weakly convex after initial training phase.
Loss can be predicted by an upper bound on the last iterate.
Scaling laws allow extrapolation of loss and learning rate by up to 80X and 70X respectively.
Abstract
Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as 80X across training horizons and 70X across model sizes.
Peer Reviews
Decision·ICLR 2026 Poster
- **Broad, consistent empirical signal.** The sequence-to-sequence fits trained on the first half of iterations and evaluated on the second half show high $R^2$ across vision (ResNet, ViT) and language (GPT-2), optimizers (SGD, AdamW, Muon-NSGD), and LR schedules. - **Simple, actionable guidance.** Horizon-aware schedules with $\eta_{\text{peak}} \propto 1/\sqrt{T}$, paired with a few pilot runs to pick $\eta_{\text{ref}}^\star$, provide a practical recipe for early forecasting and planning. - *
- **Limited theoretical novelty.** Main results rely on standard convex SGD analysis; $\mathcal{O}(1/\sqrt{T})$ rates are classical. The main theoretical additions are schedule-wise constants (some via approximation). - **Reproducibility and robustness gaps.** No code/logs are released. Experiments primarily vary LR and $T$; other key hyperparameters (weight decay, momentum/$\beta$’s, clipping, batch size) are mostly fixed (e.g., wd$=0.01$) with no ablations. - **Scope of applicability.** Appen
I believe it is rare to have a nice, predictable tendancy in neural network training, and such discovery as in this paper is useful because once we can predict what will work best in practice, e.g. can compute the optimal learning rate by hand. Even though this is an extension (at least in my view), the fact that the tendancy exists for different optimizers is an important discovery, because it is not obvious. So one strength of this paper is that it proposes a property of neural network trainin
One thing that was not clear to me was: so we have an upper bound that looks like $$ L(w_T) - L^{*} \leq \frac{A}{T\eta_{max}} + B\eta_{max}. $$ So is $A, B$ a function of the optimizer, the model, and the learning rate schedule? In other words, if these three are fixed, is $A$ and $B$ fixed regardless of the maximal learning rate? If this is not the case, I think it would make the paper much weaker (and there is potential that I may lower the score), because even though there is a tendancy th
The paper strengths are: - The paper’s motivation of predicting convex-like behavior is relevant in the understanding of the training of deep learning models. - The scaling laws found by the paper, which establishes evidence towards convex-like behavior in the training of deep models, are extensively characterized in their data-driven approach. - The paper has enough models and training procedures to properly demonstrate its claims. I really appreciate that from the authors.
Although the paper’s topic is relevant, there were multiple things that were unclear in the paper’s presentation (some of them could have been spotted after diligent proofreading). Also, there were confusing parts in the presentation of both theoretical and experimental results—I will detail this below. Given that there is still work to be done to improve the paper, I am giving the current score. **>>Important things:** - I find it odd that in the last sentence of the paragraph from line 057, t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Advanced Neural Network Applications
