Optimization for deep learning: theory and algorithms

Ruoyu Sun

arXiv:1912.08957·cs.LG·December 21, 2019·132 cites

Optimization for deep learning: theory and algorithms

Ruoyu Sun

PDF

Open Access

TL;DR

This paper reviews optimization algorithms and theoretical insights for training neural networks, addressing issues like gradient problems, initialization, and global landscape, and discusses practical solutions and recent research findings.

Contribution

It provides a comprehensive overview of optimization techniques, theory, and global issues in neural network training, integrating recent advances and practical methods.

Findings

01

Gradient explosion/vanishing issues and solutions

02

Theoretical analysis of optimization algorithms like SGD and adaptive methods

03

Research insights on local minima, mode connectivity, and infinite-width networks

Abstract

When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Neural Networks and Applications

MethodsStochastic Gradient Descent