The Marginal Value of Adaptive Gradient Methods in Machine Learning

Ashia C. Wilson; Rebecca Roelofs; Mitchell Stern; Nathan; Srebro; Benjamin Recht

arXiv:1705.08292·stat.ML·May 23, 2018·552 cites

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan, Srebro, Benjamin Recht

PDF

Open Access 3 Repos

TL;DR

This paper investigates the differences between adaptive gradient methods and traditional gradient descent in training neural networks, revealing that adaptive methods often find solutions with worse generalization despite good training performance.

Contribution

It provides theoretical and empirical evidence that adaptive methods can lead to solutions with poorer generalization compared to SGD, challenging their widespread use in deep learning.

Findings

01

Adaptive methods find solutions with worse test error than GD/SGD.

02

In a constructed example, adaptive methods fail to achieve zero test error.

03

Empirical results show adaptive methods generalize worse on real models.

Abstract

Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM

MethodsAdam · AdaGrad · RMSProp · Stochastic Gradient Descent