NoLoCo: No-all-reduce Low Communication Training Method for Large Models
Jari Kolehmainen, Nikolay Blagoev, John Donaghy, O\u{g}uzhan Ersoy, Christopher Nies

TL;DR
NoLoCo introduces a novel training method for large models that eliminates explicit synchronization, reducing communication overhead and improving convergence speed across various model sizes and cluster scales.
Contribution
It proposes a new optimizer that implicitly synchronizes model weights without collective communication, enabling efficient training on low-bandwidth networks.
Findings
Requires significantly less communication overhead than existing methods.
Achieves up to 4% faster convergence rate.
Effective across a wide range of model sizes and accelerator counts.
Abstract
Training large language models is generally done via optimization methods on clusters containing tens of thousands of accelerators, communicating over a high-bandwidth interconnect. Scaling up these clusters is expensive and can become impractical, imposing limits on the size of models that can be trained. Several recent studies have proposed training methods that are less communication intensive, avoiding the need for a highly connected compute cluster. These state-of-the-art low communication training methods still employ a synchronization step for model parameters, which, when performed over all model replicas, can become costly on a low-bandwidth network. In this work, we propose a novel optimization method, NoLoCo, that does not explicitly synchronize all model parameters during training and, as a result, does not require any collective communication. NoLoCo implicitly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
