Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks
Guodong Zhang, James Martens, Roger Grosse

TL;DR
This paper provides the first theoretical analysis demonstrating that natural gradient descent converges rapidly for overparameterized nonlinear neural networks, under specific conditions related to the Jacobian matrix, with extensions to approximate methods like K-FAC.
Contribution
It establishes conditions guaranteeing fast convergence of natural gradient descent in nonlinear neural networks and proves these conditions hold for two-layer ReLU networks under certain assumptions.
Findings
Natural gradient descent converges efficiently under Jacobian full rank and stability conditions.
For overparameterized two-layer ReLU networks, these conditions hold throughout training.
K-FAC, an approximate natural gradient method, also converges with a proven rate.
Abstract
Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for \emph{nonlinear} networks. In this work, we analyze for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss. We identify two conditions which guarantee efficient convergence from random initializations: (1) the Jacobian matrix (of network's output for all training cases with respect to the parameters) has full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer ReLU neural networks, we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Sparse and Compressive Sensing Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · *Communicated@Fast*How Do I Communicate to Expedia?
