Empirical Evaluation of Parallel Training Algorithms on Acoustic Modeling
Wenpeng Li, BinBin Zhang, Lei Xie, Dong Yu

TL;DR
This paper systematically compares four parallel training algorithms for deep learning-based speech recognition models, providing practical guidance on their efficiency, stability, and scalability for large datasets and models.
Contribution
It offers a comprehensive empirical evaluation of ASGD, BMUF, BSP, and EASGD on speech recognition tasks, highlighting BMUF as the most effective method.
Findings
BMUF is the most stable and scalable algorithm.
BMUF often outperforms single-GPU SGD.
ASGD can be a viable alternative in some scenarios.
Abstract
Deep learning models (DLMs) are state-of-the-art techniques in speech recognition. However, training good DLMs can be time consuming especially for production-size models and corpora. Although several parallel training algorithms have been proposed to improve training efficiency, there is no clear guidance on which one to choose for the task in hand due to lack of systematic and fair comparison among them. In this paper we aim at filling this gap by comparing four popular parallel training algorithms in speech recognition, namely asynchronous stochastic gradient descent (ASGD), blockwise model-update filtering (BMUF), bulk synchronous parallel (BSP) and elastic averaging stochastic gradient descent (EASGD), on 1000-hour LibriSpeech corpora using feed-forward deep neural networks (DNNs) and convolutional, long short-term memory, DNNs (CLDNNs). Based on our experiments, we recommend using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsStochastic Gradient Descent
