Second-order Information in First-order Optimization Methods
Yuzheng Hu, Licong Lin, Shange Tang

TL;DR
This paper reveals the second-order nature of first-order optimization methods, introduces a new algorithm AdaSqrt that challenges the necessity of the square root in adaptive methods, and demonstrates competitive performance on standard datasets.
Contribution
It uncovers the second-order essence of first-order methods, relates adaptive methods to Natural Gradient Descent, and proposes AdaSqrt, a novel algorithm removing the square root in updates.
Findings
AdaSqrt performs comparably to SGD and Adam on MNIST.
AdaSqrt outperforms Adam on CIFAR-10.
Removing the square root does not impair training performance.
Abstract
In this paper, we try to uncover the second-order essence of several first-order optimization methods. For Nesterov Accelerated Gradient, we rigorously prove that the algorithm makes use of the difference between past and current gradients, thus approximates the Hessian and accelerates the training. For adaptive methods, we related Adam and Adagrad to a powerful technique in computation statistics---Natural Gradient Descent. These adaptive methods can in fact be treated as relaxations of NGD with only a slight difference lying in the square root of the denominator in the update rules. Skeptical about the effect of such difference, we design a new algorithm---AdaSqrt, which removes the square root in the denominator and scales the learning rate by sqrt(T). Surprisingly, our new algorithm is comparable to various first-order methods(such as SGD and Adam) on MNIST and even beats Adam on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBlind Source Separation Techniques · Machine Learning and Algorithms · Machine Learning and ELM
MethodsAdaSqrt · Nesterov Accelerated Gradient · AdaGrad · Adam · Stochastic Gradient Descent
