Two Sides of One Coin: the Limits of Untuned SGD and the Power of   Adaptive Methods

Junchi Yang; Xiang Li; Ilyas Fatkhullin; Niao He

arXiv:2305.12475·math.OC·May 23, 2023·1 cites

Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods

Junchi Yang, Xiang Li, Ilyas Fatkhullin, Niao He

PDF

Open Access 1 Video

TL;DR

This paper analyzes the limitations of untuned SGD and demonstrates how adaptive methods like AMSGrad and AdaGrad can overcome these issues by avoiding exponential dependence on smoothness constants, providing theoretical insights into their advantages.

Contribution

It proves that untuned SGD achieves near-optimal convergence but suffers from exponential dependence on smoothness, and shows adaptive methods can mitigate this problem.

Findings

01

Untuned SGD has an optimal convergence rate but exponential dependence on smoothness.

02

Adaptive methods prevent exponential dependence without knowing smoothness.

03

Adaptive methods outperform untuned SGD in handling large gradients.

Abstract

The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize $η_{t} = η / t$ relies on well-tuned $η$ depending on problem parameters such as Lipschitz smoothness constant, which is often unknown in practice. In this work, we prove that SGD with arbitrary $η > 0$ , referred to as untuned SGD, still attains an order-optimal convergence rate $O (T^{- 1/4})$ in terms of gradient norm for minimizing smooth objectives. Unfortunately, it comes at the expense of a catastrophic exponential dependence on the smoothness constant, which we show is unavoidable for this scheme even in the noiseless setting. We then examine three families of adaptive methods $\unicode x 2013$ Normalized SGD (NSGD), AMSGrad, and AdaGrad $\unicode x 2013$ unveiling their power in preventing such exponential dependency in the absence of information about the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Advanced Image Processing Techniques

MethodsAMSGrad · AdaGrad · Stochastic Gradient Descent