OpenDiLoCo: An Open-Source Framework for Globally Distributed   Low-Communication Training

Sami Jaghouar; Jack Min Ong; Johannes Hagemann

arXiv:2407.07852·cs.LG·July 11, 2024

OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

Sami Jaghouar, Jack Min Ong, Johannes Hagemann

PDF

Open Access 1 Repo

TL;DR

OpenDiLoCo is an open-source framework enabling scalable, low-communication training of large language models across distributed systems, maintaining high efficiency and scalability for billion-parameter models.

Contribution

It provides a reproducible, scalable implementation of the DiLoCo training method, demonstrating effective large-scale, low-communication training across multiple continents.

Findings

01

Achieved 90-95% compute utilization during training across continents.

02

Gradient all-reduction with FP16 does not degrade performance.

03

Scaled the framework to train models three times larger than previous work.

Abstract

OpenDiLoCo is an open-source implementation and replication of the Distributed Low-Communication (DiLoCo) training method for large language models. We provide a reproducible implementation of the DiLoCo experiments, offering it within a scalable, decentralized training framework using the Hivemind library. We demonstrate its effectiveness by training a model across two continents and three countries, while maintaining 90-95% compute utilization. Additionally, we conduct ablations studies focusing on the algorithm's compute efficiency, scalability in the number of workers and show that its gradients can be all-reduced using FP16 without any performance degradation. Furthermore, we scale OpenDiLoCo to 3x the size of the original work, demonstrating its effectiveness for billion parameter models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PrimeIntellect-ai/OpenDiLoCo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis · Intelligent Tutoring Systems and Adaptive Learning · Educational Technology and Assessment