Enhancing Stability for Large Language Models Training in Constrained   Bandwidth Networks

Yun Dai; Tejas Dharamsi; Byron Hsu; Tao Song; Hamed Firooz

arXiv:2407.01614·cs.LG·October 8, 2024

Enhancing Stability for Large Language Models Training in Constrained Bandwidth Networks

Yun Dai, Tejas Dharamsi, Byron Hsu, Tao Song, Hamed Firooz

PDF

Open Access

TL;DR

This paper identifies convergence issues in large language model training on low-bandwidth networks caused by race conditions in hierarchical partitioning, and proposes an improved algorithm that ensures reliable training of billion-parameter models with high efficiency.

Contribution

The paper introduces a modified partitioning algorithm that fixes convergence problems in ZeRO++ hpZ, enabling stable training of large models on low-bandwidth clusters.

Findings

01

Achieves reliable convergence on Falcon and Llama-2 models.

02

Maintains 98% throughput and training speed.

03

Ensures high-quality model convergence.

Abstract

Training extremely large language models (LLMs) with billions of parameters is a computationally intensive task that pushes the limits of current data parallel training systems. While techniques like ZeRO++ have enabled efficient distributed training of such giant models on inexpensive low-bandwidth clusters, they can suffer from convergence issues due to potential race conditions in the hierarchical partitioning (hpZ) scheme employed to reduce cross-machine communication. In this work, we first show how these race conditions cause instability when training models with billions of parameters. We then propose a modification to the partitioning algorithm that addresses these convergence challenges while maintaining competitive training efficiency. Empirical evaluation on training the multi-billion parameters Falcon Models and Llama-2 models demonstrates the updated algorithm's ability to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings