Asynchronous Decentralized Distributed Training of Acoustic Models
Xiaodong Cui, Wei Zhang, Abdullah Kayi, Mingrui Liu, Ulrich Finkler,, Brian Kingsbury, George Saon, David Kung

TL;DR
This paper explores asynchronous decentralized distributed training methods for acoustic models, demonstrating their advantages over traditional synchronous approaches, especially with large batch sizes, and achieving fast training times with high accuracy.
Contribution
The paper introduces and analyzes three variants of asynchronous decentralized parallel SGD, providing theoretical convergence rates and empirical evaluations for acoustic model training.
Findings
ADPSGD variants outperform synchronous training with large batches.
Delay-by-one scheme achieves fastest convergence among variants.
Models trained with ADPSGD reach high accuracy in less than 2 hours using 128 GPUs.
Abstract
Large-scale distributed training of deep acoustic models plays an important role in today's high-performance automatic speech recognition (ASR). In this paper we investigate a variety of asynchronous decentralized distributed training strategies based on data parallel stochastic gradient descent (SGD) to show their superior performance over the commonly-used synchronous distributed training via allreduce, especially when dealing with large batch sizes. Specifically, we study three variants of asynchronous decentralized parallel SGD (ADPSGD), namely, fixed and randomized communication patterns on a ring as well as a delay-by-one scheme. We introduce a mathematical model of ADPSGD, give its theoretical convergence rate, and compare the empirical convergence behavior and straggler resilience properties of the three variants. Experiments are carried out on an IBM supercomputer for training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsStochastic Gradient Descent
