Parallel training of DNNs with Natural Gradient and Parameter Averaging
Daniel Povey, Xiaohui Zhang, Sanjeev Khudanpur

TL;DR
This paper presents a scalable neural network training framework that combines parameter averaging across multiple machines with an efficient Natural Gradient method, improving convergence and hardware utilization.
Contribution
It introduces a novel combination of parameter averaging and an approximate Natural Gradient method for effective multi-machine DNN training.
Findings
Parameter averaging enables multi-machine training with minimal network traffic.
The approximate Natural Gradient significantly improves convergence.
The combined method performs well in large-scale speech recognition tasks.
Abstract
We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsStochastic Gradient Descent
