Exploring Scaling Laws for Local SGD in Large Language Model Training

Qiaozhi He; Xiaomin Zhuang; Zhihua Wu

arXiv:2409.13198·cs.CL·September 23, 2024

Exploring Scaling Laws for Local SGD in Large Language Model Training

Qiaozhi He, Xiaomin Zhuang, Zhihua Wu

PDF

Open Access 1 Models

TL;DR

This paper studies how local SGD scales for training large language models, demonstrating its effectiveness and exploring its use in multi-cluster and edge computing environments through extensive experiments.

Contribution

It provides new insights into the scaling laws of local SGD for LLM training and evaluates its practical application in distributed and edge computing scenarios.

Findings

01

Local SGD achieves competitive performance with traditional methods.

02

Effective multi-cluster training conditions are identified.

03

Edge computing can be a viable alternative for LLM training.

Abstract

This paper investigates scaling laws for local SGD in LLM training, a distributed optimization algorithm that facilitates training on loosely connected devices. Through extensive experiments, we show that local SGD achieves competitive results compared to conventional methods, given equivalent model parameters, datasets, and computational resources. Furthermore, we explore the application of local SGD in various practical scenarios, including multi-cluster setups and edge computing environments. Our findings elucidate the necessary conditions for effective multi-cluster LLM training and examine the potential and limitations of leveraging edge computing resources in the LLM training process. This demonstrates its viability as an alternative to single large-cluster training.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
chuxin-llm/Scaling-Laws-for-Local-SGD-in-LLM-Intermediate-Checkpoints
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsStochastic Gradient Descent · Local SGD