Stochastic Non-Smooth Convex Optimization with Unbounded Gradients
Dmitry Kovalev

TL;DR
This paper introduces a new class of generalized Lipschitz functions for non-smooth convex optimization with unbounded gradients, and demonstrates that clipped AdamW outperforms other methods in this setting.
Contribution
It develops a convergence analysis for AdamW with clipping, showing its superiority and universality for generalized Lipschitz and smooth convex optimization problems.
Findings
Clipped AdamW outperforms SGD and AdaGrad in generalized Lipschitz convex optimization.
Exponential gradient weighting in AdamW is crucial for its convergence.
Clipped AdamW achieves improved rates under generalized smoothness and quasar-convexity.
Abstract
Much of the existing theory on first-order non-smooth optimization is built on a restrictive assumption that the gradients of the objective function are uniformly bounded. We introduce a much more realistic class of generalized Lipschitz functions, where the gradient norms are bounded by an affine function of the optimality gap. We then ask a natural question: what algorithm achieves the best global convergence rates for solving convex stochastic generalized Lipschitz optimization problems? To address this, we develop a new convergence analysis for several existing algorithms and find that AdamW with clipped updates, theoretically outperforms other popular stochastic optimization methods, such as SGD and AdaGrad. Moreover, our analysis establishes the critical role of AdamW's exponentially weighted gradient accumulation, as opposed to simple averaging. We further show that clipped AdamW…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
