Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization
Aaron Defazio, Samy Jelassi

TL;DR
MADGRAD is a new adaptive gradient optimization method that consistently outperforms or matches SGD and ADAM across various deep learning tasks, demonstrating strong versatility and effectiveness.
Contribution
MADGRAD introduces a momentumized, dual-averaged adaptive gradient method that improves performance in stochastic optimization for deep learning.
Findings
MADGRAD outperforms SGD and ADAM on multiple deep learning tasks.
MADGRAD performs well on vision and NLP problems.
MADGRAD matches or exceeds test set performance of existing methods.
Abstract
We introduce MADGRAD, a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing. For each of these tasks, MADGRAD matches or outperforms both SGD and ADAM in test set performance, even on problems for which adaptive methods normally perform poorly.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsMomentumized, adaptive, dual averaged gradient · Stochastic Gradient Descent · Adam · AdaGrad
