Fast Convergence of Natural Gradient Descent for Overparameterized   Neural Networks

Guodong Zhang; James Martens; Roger Grosse

arXiv:1905.10961·stat.ML·October 29, 2019·41 cites

Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks

Guodong Zhang, James Martens, Roger Grosse

PDF

Open Access

TL;DR

This paper provides the first theoretical analysis demonstrating that natural gradient descent converges rapidly for overparameterized nonlinear neural networks, under specific conditions related to the Jacobian matrix, with extensions to approximate methods like K-FAC.

Contribution

It establishes conditions guaranteeing fast convergence of natural gradient descent in nonlinear neural networks and proves these conditions hold for two-layer ReLU networks under certain assumptions.

Findings

01

Natural gradient descent converges efficiently under Jacobian full rank and stability conditions.

02

For overparameterized two-layer ReLU networks, these conditions hold throughout training.

03

K-FAC, an approximate natural gradient method, also converges with a proven rate.

Abstract

Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for \emph{nonlinear} networks. In this work, we analyze for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss. We identify two conditions which guarantee efficient convergence from random initializations: (1) the Jacobian matrix (of network's output for all training cases with respect to the parameters) has full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer ReLU neural networks, we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Sparse and Compressive Sensing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · *Communicated@Fast*How Do I Communicate to Expedia?