Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Yuanzhi Li; Yang Yuan

arXiv:1705.09886·cs.LG·November 3, 2017·275 cites

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Yuanzhi Li, Yang Yuan

PDF

Open Access

TL;DR

This paper provides a theoretical convergence analysis of SGD on a specific subset of two-layer ReLU neural networks with an identity mapping structure, showing conditions for global convergence and improved performance.

Contribution

It introduces a novel convergence analysis for two-layer ReLU networks with identity mapping, demonstrating global convergence under Gaussian input and standard initialization.

Findings

01

SGD converges to the global minimum in polynomial steps for networks with identity mapping.

02

The identity mapping ensures a unique global minimum and better performance compared to vanilla networks.

03

Convergence occurs in two phases, with initial misdirection followed by convergence in a convex region.

Abstract

In recent years, stochastic gradient descent (SGD) based techniques has become the standard tools for training neural networks. However, formal theoretical understanding of why SGD can train neural networks in practice is largely missing. In this paper, we make progress on understanding this mystery by providing a convergence analysis for SGD on a rich subset of two-layer feedforward networks with ReLU activations. This subset is characterized by a special structure called "identity mapping". We prove that, if input follows from Gaussian distribution, with standard $O (1/ d)$ initialization of the weights, SGD converges to the global minimum in polynomial number of steps. Unlike normal vanilla networks, the "identity mapping" makes our network asymmetric and thus the global minimum is unique. To complement our theory, we are also able to show experimentally that multi-layer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Advanced Neural Network Applications

MethodsAffine Coupling · Normalizing Flows · *Communicated@Fast*How Do I Communicate to Expedia? · Stochastic Gradient Descent