AdaSGD: Bridging the gap between SGD and Adam
Jiaxuan Wang, Jenna Wiens

TL;DR
This paper introduces AdaSGD, an adaptive learning rate method that bridges the performance gap between SGD and Adam, providing a unified approach that improves convergence and generalization across various tasks.
Contribution
The paper proposes AdaSGD, a novel optimization algorithm that adapts a global learning rate to combine the strengths of SGD and Adam, supported by theoretical and empirical analysis.
Findings
AdaSGD outperforms standard SGD and Adam in multiple benchmarks.
Theoretical insights explain when and why Adam or SGD perform better.
Empirical results show AdaSGD eliminates the need for transitioning between optimizers.
Abstract
In the context of stochastic gradient descent(SGD) and adaptive moment estimation (Adam),researchers have recently proposed optimization techniques that transition from Adam to SGD with the goal of improving both convergence and generalization performance. However, precisely how each approach trades off early progress and generalization is not well understood; thus, it is unclear when or even if, one should transition from one approach to the other. In this work, by first studying the convex setting, we identify potential contributors to observed differences in performance between SGD and Adam. In particular,we provide theoretical insights for when and why Adam outperforms SGD and vice versa. We ad-dress the performance gap by adapting a single global learning rate for SGD, which we refer to as AdaSGD. We justify this proposed approach with empirical analyses in non-convex settings. On…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Machine Learning and ELM
MethodsStochastic Gradient Descent · Adam
