A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent
Shuo Xie, Tianhao Wang, Beining Wu, Zhiyuan Li

TL;DR
This paper explores the connection between adaptive optimizers and non-Euclidean geometries, extending adaptive smoothness theory to nonconvex settings and demonstrating accelerated convergence guarantees.
Contribution
It introduces adaptive smoothness for nonconvex optimization and shows how it enables acceleration and dimension-free convergence in non-Euclidean geometries.
Findings
Adaptive smoothness characterizes adaptive optimizer convergence.
Adaptive smoothness enables acceleration with Nesterov momentum.
Dimension-free convergence guarantees are achieved in stochastic settings.
Abstract
Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction between their analyses, however, lies in the geometries, e.g., smoothness notions, they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum in the convex setting, a guarantee unattainable under standard smoothness for certain non-Euclidean geometry. We further develop an analogous comparison for stochastic optimization by…
Peer Reviews
Decision·ICLR 2026 Poster
- By identifying the two distinct smoothness notions and linking the stronger "adaptive smoothness" to the ability to achieve acceleration (which is provably impossible for the weaker, standard smoothness), the paper provides a clear separation. - The authors' parallel "adaptive variance" argument is well-supported, as they provide not just a pessimistic, dimension-dependent upper bound for standard variance (Prop D.4) but also a corresponding lower bound (Thm 4.7) to prove that this $d$-depende
- I am not fully clear about the distinction between adaptive methods and NSD, as the author claims they exploit different notions of smoothness. For instance, under the adaptive smoothness assumption, NSD could potentially also achieve the same rate as in Theorem 3.3, and when combined with the acceleration technique in Eq. (4), could possibly also attain an $O(1/T^2)$ rate for convex functions. Conversely, adaptive methods could also match the rate of NSD under $L\_{\\|\cdot\\|\_{\mathcal{H}}}
1. The paper is well-written and contributes to an important problem, namely understanding the improved performance of preconditioning based optimizers. 2. There are a lot of theoretical results, and from what I can tell, these results are not simple repetitions of known ideas. In particular, the results in the non-convex setting rely on matrix inequalities that involve a good bit of technical work. I checked the proofs of these matrix inequalities (Lemma 3.4 and all lemmas it relies on), and th
1. There are several examples of exaggerated or inaccurate language related to the contribution of this work. I feel that the paper would be stronger if the writing was more direct, transparent, and objective about the contributions and the relationship to previous work. Below are some examples. 1a. The paragraph on lines 64-71 seems to misrepresent the authors' contribution. The paragraph states that the contribution is to show the accelerated $1/T^2$ rate for preconditioned algorithms, bu
- Two genuinely different smoothness/variance regimes with algorithmic consequences. The paper formalizes why “adaptive optimizers is approximately NSD with a norm” is not the full story: adaptive methods converge under adaptive smoothness $\Lambda_{\mathcal H}(f)$, which is always $\ge$ the standard smoothness used by NSD (Prop. 2.3). Under adaptive smoothness, accelerated $O(\tilde T^{-2})$ convergence with Nesterov momentum is achieved (Thm. 4.4), while standard $\ell_\infty$-smoothness faces
- The main guarantees hinge on adaptive quantities—$\Lambda_{\mathcal H}(f)$ (minimizing trace bounds over a preconditioner set) and $\sigma_{\mathcal H}(f)$ (sup over $x,t$ with minimization over $H\in\mathcal H$). These are hard to estimate or upper-bound in realistic deep-learning settings. The paper discusses relationships to bounded covariance (e.g., via $\mathrm{Tr}(P_{\mathcal H}(\Sigma))$) but stops short of estimators/diagnostics practitioners could compute to check assumptions or guide
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Sparse and Compressive Sensing Techniques
