Optimization for deep learning: theory and algorithms
Ruoyu Sun

TL;DR
This paper reviews optimization algorithms and theoretical insights for training neural networks, addressing issues like gradient problems, initialization, and global landscape, and discusses practical solutions and recent research findings.
Contribution
It provides a comprehensive overview of optimization techniques, theory, and global issues in neural network training, integrating recent advances and practical methods.
Findings
Gradient explosion/vanishing issues and solutions
Theoretical analysis of optimization algorithms like SGD and adaptive methods
Research insights on local minima, mode connectivity, and infinite-width networks
Abstract
When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Neural Networks and Applications
MethodsStochastic Gradient Descent
