An overview of gradient descent optimization algorithms
Sebastian Ruder

TL;DR
This paper provides an intuitive overview of various gradient descent optimization algorithms, discussing their behaviors, challenges, and strategies for effective use in different settings.
Contribution
It offers a comprehensive summary of gradient descent variants, challenges, and optimization strategies, aiding practitioners in understanding their practical strengths and weaknesses.
Findings
Different variants of gradient descent are compared and explained.
Challenges in applying gradient descent are summarized.
Strategies for optimizing gradient descent in various architectures are reviewed.
Abstract
Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use. In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Medical Image Segmentation Techniques · Face and Expression Recognition
