Exploring Scaling Laws for Local SGD in Large Language Model Training
Qiaozhi He, Xiaomin Zhuang, Zhihua Wu

TL;DR
This paper studies how local SGD scales for training large language models, demonstrating its effectiveness and exploring its use in multi-cluster and edge computing environments through extensive experiments.
Contribution
It provides new insights into the scaling laws of local SGD for LLM training and evaluates its practical application in distributed and edge computing scenarios.
Findings
Local SGD achieves competitive performance with traditional methods.
Effective multi-cluster training conditions are identified.
Edge computing can be a viable alternative for LLM training.
Abstract
This paper investigates scaling laws for local SGD in LLM training, a distributed optimization algorithm that facilitates training on loosely connected devices. Through extensive experiments, we show that local SGD achieves competitive results compared to conventional methods, given equivalent model parameters, datasets, and computational resources. Furthermore, we explore the application of local SGD in various practical scenarios, including multi-cluster setups and edge computing environments. Our findings elucidate the necessary conditions for effective multi-cluster LLM training and examine the potential and limitations of leveraging edge computing resources in the LLM training process. This demonstrates its viability as an alternative to single large-cluster training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsStochastic Gradient Descent · Local SGD
