The Marginal Value of Adaptive Gradient Methods in Machine Learning
Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan, Srebro, Benjamin Recht

TL;DR
This paper investigates the differences between adaptive gradient methods and traditional gradient descent in training neural networks, revealing that adaptive methods often find solutions with worse generalization despite good training performance.
Contribution
It provides theoretical and empirical evidence that adaptive methods can lead to solutions with poorer generalization compared to SGD, challenging their widespread use in deep learning.
Findings
Adaptive methods find solutions with worse test error than GD/SGD.
In a constructed example, adaptive methods fail to achieve zero test error.
Empirical results show adaptive methods generalize worse on real models.
Abstract
Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM
MethodsAdam · AdaGrad · RMSProp · Stochastic Gradient Descent
