Enhancing Stability for Large Language Models Training in Constrained Bandwidth Networks
Yun Dai, Tejas Dharamsi, Byron Hsu, Tao Song, Hamed Firooz

TL;DR
This paper identifies convergence issues in large language model training on low-bandwidth networks caused by race conditions in hierarchical partitioning, and proposes an improved algorithm that ensures reliable training of billion-parameter models with high efficiency.
Contribution
The paper introduces a modified partitioning algorithm that fixes convergence problems in ZeRO++ hpZ, enabling stable training of large models on low-bandwidth clusters.
Findings
Achieves reliable convergence on Falcon and Llama-2 models.
Maintains 98% throughput and training speed.
Ensures high-quality model convergence.
Abstract
Training extremely large language models (LLMs) with billions of parameters is a computationally intensive task that pushes the limits of current data parallel training systems. While techniques like ZeRO++ have enabled efficient distributed training of such giant models on inexpensive low-bandwidth clusters, they can suffer from convergence issues due to potential race conditions in the hierarchical partitioning (hpZ) scheme employed to reduce cross-machine communication. In this work, we first show how these race conditions cause instability when training models with billions of parameters. We then propose a modification to the partitioning algorithm that addresses these convergence challenges while maintaining competitive training efficiency. Empirical evaluation on training the multi-billion parameters Falcon Models and Llama-2 models demonstrates the updated algorithm's ability to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
