The Rich and the Simple: On the Implicit Bias of Adam and SGD
Bhavya Vasudeva, Jung Whan Lee, Vatsal Sharan, Mahdi Soltanolkotabi

TL;DR
This paper compares the implicit biases of Adam and SGD in training neural networks, showing Adam's tendency to find more complex, optimal solutions leading to better generalization, especially under distribution shifts.
Contribution
It provides a theoretical and empirical analysis demonstrating Adam's resistance to simplicity bias compared to SGD, resulting in richer features and improved generalization.
Findings
Adam produces more complex decision boundaries than SGD.
Adam achieves higher test accuracy under distribution shifts.
Theoretical analysis confirms differences in implicit bias between Adam and SGD.
Abstract
Adam is the de facto optimization algorithm for several deep learning applications, but an understanding of its implicit bias and how it differs from other algorithms, particularly standard first-order methods such as (stochastic) gradient descent (GD), remains limited. In practice, neural networks (NNs) trained with SGD are known to exhibit simplicity bias -- a tendency to find simple solutions. In contrast, we show that Adam is more resistant to such simplicity bias. First, we investigate the differences in the implicit biases of Adam and GD when training two-layer ReLU NNs on a binary classification task with Gaussian data. We find that GD exhibits a simplicity bias, resulting in a linear decision boundary with a suboptimal margin, whereas Adam leads to much richer and more diverse features, producing a nonlinear boundary that is closer to the Bayes' optimal predictor. This richer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhilosophy and Theoretical Science
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Stochastic Gradient Descent · Adam
