A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent

Shuo Xie; Tianhao Wang; Beining Wu; Zhiyuan Li

arXiv:2511.20584·cs.LG·November 26, 2025

A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent

Shuo Xie, Tianhao Wang, Beining Wu, Zhiyuan Li

PDF

Open Access 3 Reviews

TL;DR

This paper explores the connection between adaptive optimizers and non-Euclidean geometries, extending adaptive smoothness theory to nonconvex settings and demonstrating accelerated convergence guarantees.

Contribution

It introduces adaptive smoothness for nonconvex optimization and shows how it enables acceleration and dimension-free convergence in non-Euclidean geometries.

Findings

01

Adaptive smoothness characterizes adaptive optimizer convergence.

02

Adaptive smoothness enables acceleration with Nesterov momentum.

03

Dimension-free convergence guarantees are achieved in stochastic settings.

Abstract

Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction between their analyses, however, lies in the geometries, e.g., smoothness notions, they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum in the convex setting, a guarantee unattainable under standard smoothness for certain non-Euclidean geometry. We further develop an analogous comparison for stochastic optimization by…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- By identifying the two distinct smoothness notions and linking the stronger "adaptive smoothness" to the ability to achieve acceleration (which is provably impossible for the weaker, standard smoothness), the paper provides a clear separation. - The authors' parallel "adaptive variance" argument is well-supported, as they provide not just a pessimistic, dimension-dependent upper bound for standard variance (Prop D.4) but also a corresponding lower bound (Thm 4.7) to prove that this $d$-depende

Weaknesses

- I am not fully clear about the distinction between adaptive methods and NSD, as the author claims they exploit different notions of smoothness. For instance, under the adaptive smoothness assumption, NSD could potentially also achieve the same rate as in Theorem 3.3, and when combined with the acceleration technique in Eq. (4), could possibly also attain an $O(1/T^2)$ rate for convex functions. Conversely, adaptive methods could also match the rate of NSD under $L\_{\\|\cdot\\|\_{\mathcal{H}}}

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper is well-written and contributes to an important problem, namely understanding the improved performance of preconditioning based optimizers. 2. There are a lot of theoretical results, and from what I can tell, these results are not simple repetitions of known ideas. In particular, the results in the non-convex setting rely on matrix inequalities that involve a good bit of technical work. I checked the proofs of these matrix inequalities (Lemma 3.4 and all lemmas it relies on), and th

Weaknesses

1. There are several examples of exaggerated or inaccurate language related to the contribution of this work. I feel that the paper would be stronger if the writing was more direct, transparent, and objective about the contributions and the relationship to previous work. Below are some examples. 1a. The paragraph on lines 64-71 seems to misrepresent the authors' contribution. The paragraph states that the contribution is to show the accelerated $1/T^2$ rate for preconditioned algorithms, bu

Reviewer 03Rating 4Confidence 2

Strengths

- Two genuinely different smoothness/variance regimes with algorithmic consequences. The paper formalizes why “adaptive optimizers is approximately NSD with a norm” is not the full story: adaptive methods converge under adaptive smoothness $\Lambda_{\mathcal H}(f)$, which is always $\ge$ the standard smoothness used by NSD (Prop. 2.3). Under adaptive smoothness, accelerated $O(\tilde T^{-2})$ convergence with Nesterov momentum is achieved (Thm. 4.4), while standard $\ell_\infty$-smoothness faces

Weaknesses

- The main guarantees hinge on adaptive quantities—$\Lambda_{\mathcal H}(f)$ (minimizing trace bounds over a preconditioner set) and $\sigma_{\mathcal H}(f)$ (sup over $x,t$ with minimization over $H\in\mathcal H$). These are hard to estimate or upper-bound in realistic deep-learning settings. The paper discusses relationships to bounded covariance (e.g., via $\mathrm{Tr}(P_{\mathcal H}(\Sigma))$) but stops short of estimators/diagnostics practitioners could compute to check assumptions or guide

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Sparse and Compressive Sensing Techniques