On the Convergence of SGD Training of Neural Networks
Thomas M. Breuel

TL;DR
This paper investigates the convergence behavior of SGD in neural network training, revealing that common phenomena like local minima are less influential than the simultaneous convergence of many independent subproblems.
Contribution
It challenges traditional views by showing that phenomena like local minima are not significant, and models SGD as converging across many largely independent subproblems.
Findings
SGD behavior is better described by convergence of subproblems
Local minima are not significant in neural network training
Multiple subproblems converge at different rates
Abstract
Neural networks are usually trained by some form of stochastic gradient descent (SGD)). A number of strategies are in common use intended to improve SGD optimization, such as learning rate schedules, momentum, and batching. These are motivated by ideas about the occurrence of local minima at different scales, valleys, and other phenomena in the objective function. Empirical results presented here suggest that these phenomena are not significant factors in SGD optimization of MLP-related objective functions, and that the behavior of stochastic gradient descent in these problems is better described as the simultaneous convergence at different rates of many, largely non-interacting subproblems
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Stochastic Gradient Optimization Techniques · Machine Learning and ELM
MethodsStochastic Gradient Descent
