Non-convergence of stochastic gradient descent in the training of deep   neural networks

Patrick Cheridito; Arnulf Jentzen; Florian Rossmannek

arXiv:2006.07075·cs.LG·October 13, 2021

Non-convergence of stochastic gradient descent in the training of deep neural networks

Patrick Cheridito, Arnulf Jentzen, Florian Rossmannek

PDF

TL;DR

This paper investigates the limitations of stochastic gradient descent in training deep neural networks, showing it can fail to converge under certain architectural and initialization conditions, especially for very deep ReLU networks.

Contribution

The paper provides a rigorous analysis demonstrating non-convergence of stochastic gradient descent for deep ReLU networks when depth exceeds width and initializations are limited.

Findings

01

SGD fails to converge for very deep ReLU networks under certain conditions.

02

Convergence requires increasing the number of initializations proportionally to network depth.

03

Theoretical insights explain why training very deep networks with limited initializations can be problematic.

Abstract

Deep neural networks have successfully been trained in various application areas with stochastic gradient descent. However, there exists no rigorous mathematical explanation why this works so well. The training of neural networks with stochastic gradient descent has four different discretization parameters: (i) the network architecture; (ii) the amount of training data; (iii) the number of gradient steps; and (iv) the number of randomly initialized gradient trajectories. While it can be shown that the approximation error converges to zero if all four parameters are sent to infinity in the right order, we demonstrate in this paper that stochastic gradient descent fails to converge for ReLU networks if their depth is much larger than their width and the number of random initializations does not increase to infinity fast enough.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methods*Communicated@Fast*How Do I Communicate to Expedia?