Parallel training of DNNs with Natural Gradient and Parameter Averaging

Daniel Povey; Xiaohui Zhang; Sanjeev Khudanpur

arXiv:1410.7455·cs.NE·June 24, 2015·ICLR·123 cites

Parallel training of DNNs with Natural Gradient and Parameter Averaging

Daniel Povey, Xiaohui Zhang, Sanjeev Khudanpur

PDF

Open Access 1 Repo

TL;DR

This paper presents a scalable neural network training framework that combines parameter averaging across multiple machines with an efficient Natural Gradient method, improving convergence and hardware utilization.

Contribution

It introduces a novel combination of parameter averaging and an approximate Natural Gradient method for effective multi-machine DNN training.

Findings

01

Parameter averaging enables multi-machine training with minimal network traffic.

02

The approximate Natural Gradient significantly improves convergence.

03

The combined method performs well in large-scale speech recognition tasks.

Abstract

We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YiwenShaoStephen/NGD-SGD
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsStochastic Gradient Descent