OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training
Sami Jaghouar, Jack Min Ong, Johannes Hagemann

TL;DR
OpenDiLoCo is an open-source framework enabling scalable, low-communication training of large language models across distributed systems, maintaining high efficiency and scalability for billion-parameter models.
Contribution
It provides a reproducible, scalable implementation of the DiLoCo training method, demonstrating effective large-scale, low-communication training across multiple continents.
Findings
Achieved 90-95% compute utilization during training across continents.
Gradient all-reduction with FP16 does not degrade performance.
Scaled the framework to train models three times larger than previous work.
Abstract
OpenDiLoCo is an open-source implementation and replication of the Distributed Low-Communication (DiLoCo) training method for large language models. We provide a reproducible implementation of the DiLoCo experiments, offering it within a scalable, decentralized training framework using the Hivemind library. We demonstrate its effectiveness by training a model across two continents and three countries, while maintaining 90-95% compute utilization. Additionally, we conduct ablations studies focusing on the algorithm's compute efficiency, scalability in the number of workers and show that its gradients can be all-reduced using FP16 without any performance degradation. Furthermore, we scale OpenDiLoCo to 3x the size of the original work, demonstrating its effectiveness for billion parameter models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis · Intelligent Tutoring Systems and Adaptive Learning · Educational Technology and Assessment
