Second-order Information in First-order Optimization Methods

Yuzheng Hu; Licong Lin; Shange Tang

arXiv:1912.09926·cs.LG·December 23, 2019·1 cites

Second-order Information in First-order Optimization Methods

Yuzheng Hu, Licong Lin, Shange Tang

PDF

Open Access

TL;DR

This paper reveals the second-order nature of first-order optimization methods, introduces a new algorithm AdaSqrt that challenges the necessity of the square root in adaptive methods, and demonstrates competitive performance on standard datasets.

Contribution

It uncovers the second-order essence of first-order methods, relates adaptive methods to Natural Gradient Descent, and proposes AdaSqrt, a novel algorithm removing the square root in updates.

Findings

01

AdaSqrt performs comparably to SGD and Adam on MNIST.

02

AdaSqrt outperforms Adam on CIFAR-10.

03

Removing the square root does not impair training performance.

Abstract

In this paper, we try to uncover the second-order essence of several first-order optimization methods. For Nesterov Accelerated Gradient, we rigorously prove that the algorithm makes use of the difference between past and current gradients, thus approximates the Hessian and accelerates the training. For adaptive methods, we related Adam and Adagrad to a powerful technique in computation statistics---Natural Gradient Descent. These adaptive methods can in fact be treated as relaxations of NGD with only a slight difference lying in the square root of the denominator in the update rules. Skeptical about the effect of such difference, we design a new algorithm---AdaSqrt, which removes the square root in the denominator and scales the learning rate by sqrt(T). Surprisingly, our new algorithm is comparable to various first-order methods(such as SGD and Adam) on MNIST and even beats Adam on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlind Source Separation Techniques · Machine Learning and Algorithms · Machine Learning and ELM

MethodsAdaSqrt · Nesterov Accelerated Gradient · AdaGrad · Adam · Stochastic Gradient Descent