On the Convergence of SGD Training of Neural Networks

Thomas M. Breuel

arXiv:1508.02790·cs.NE·August 13, 2015·1 cites

On the Convergence of SGD Training of Neural Networks

Thomas M. Breuel

PDF

Open Access

TL;DR

This paper investigates the convergence behavior of SGD in neural network training, revealing that common phenomena like local minima are less influential than the simultaneous convergence of many independent subproblems.

Contribution

It challenges traditional views by showing that phenomena like local minima are not significant, and models SGD as converging across many largely independent subproblems.

Findings

01

SGD behavior is better described by convergence of subproblems

02

Local minima are not significant in neural network training

03

Multiple subproblems converge at different rates

Abstract

Neural networks are usually trained by some form of stochastic gradient descent (SGD)). A number of strategies are in common use intended to improve SGD optimization, such as learning rate schedules, momentum, and batching. These are motivated by ideas about the occurrence of local minima at different scales, valleys, and other phenomena in the objective function. Empirical results presented here suggest that these phenomena are not significant factors in SGD optimization of MLP-related objective functions, and that the behavior of stochastic gradient descent in these problems is better described as the simultaneous convergence at different rates of many, largely non-interacting subproblems

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Stochastic Gradient Optimization Techniques · Machine Learning and ELM

MethodsStochastic Gradient Descent