A Random Gossip BMUF Process for Neural Language Modeling
Yiheng Huang, Jinchuan Tian, Lei Han, Guangsen Wang and, Xingcheng Song, Dan Su, Dong Yu

TL;DR
This paper introduces a decentralized gossip-based BMUF method for neural language modeling that reduces communication overhead and improves scalability, achieving better perplexity scores on large datasets with multiple GPUs.
Contribution
It proposes a novel decentralized BMUF process using random neighbor communication, enhancing scalability and performance in distributed neural language model training.
Findings
Achieves lower perplexity than single-GPU baseline on wiki-text-103.
Maintains performance without degradation when scaling to 8 and 16 GPUs.
Outperforms conventional BMUF in experimental evaluations.
Abstract
Neural network language model (NNLM) is an essential component of industrial ASR systems. One important challenge of training an NNLM is to leverage between scaling the learning process and handling big data. Conventional approaches such as block momentum provides a blockwise model update filtering (BMUF) process and achieves almost linear speedups with no performance degradation for speech recognition. However, it needs to calculate the model average from all computing nodes (e.g., GPUs) and when the number of computing nodes is large, the learning suffers from the severe communication latency. As a consequence, BMUF is not suitable under restricted network conditions. In this paper, we present a decentralized BMUF process, in which the model is split into different components, each of which is updated by communicating to some randomly chosen neighbor nodes with the same component,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
