Revisiting Natural Gradient for Deep Networks
Razvan Pascanu, Yoshua Bengio

TL;DR
This paper reevaluates natural gradient for deep learning, connecting it with other second-order methods, exploring unlabeled data benefits, robustness, and extending it with second-order info for improved training.
Contribution
It establishes connections between natural gradient and recent second-order methods, introduces an extension with second-order information, and benchmarks the improved algorithm.
Findings
Natural gradient is connected to Hessian-Free, Krylov Subspace Descent, and TONGA methods.
Using unlabeled data can enhance generalization in natural gradient training.
The extended method with second-order info shows promising results in benchmarks.
Abstract
We evaluate natural gradient, an algorithm originally proposed in Amari (1997), for learning deep models. The contributions of this paper are as follows. We show the connection between natural gradient and three other recently proposed methods for training deep models: Hessian-Free (Martens, 2010), Krylov Subspace Descent (Vinyals and Povey, 2012) and TONGA (Le Roux et al., 2008). We describe how one can use unlabeled data to improve the generalization error obtained by natural gradient and empirically evaluate the robustness of the algorithm to the ordering of the training set compared to stochastic gradient descent. Finally we extend natural gradient to incorporate second order information alongside the manifold information and provide a benchmark of the new algorithm using a truncated Newton approach for inverting the metric matrix instead of using a diagonal approximation of it.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques
