AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models
Nikolay Kutuzov, Makar Baderko, Stepan Kulibaba, Artem Dzhalilov, Daniel Bobrov, Maxim Mashtaler, Alexander Gasnikov

TL;DR
AdLoCo introduces an adaptive batching approach with multi-instance training and switch mode to enhance communication efficiency and convergence speed in large language model training on heterogeneous hardware.
Contribution
The paper presents a novel three-stage method combining MIT, adaptive batching, and switch mode to improve distributed LLM training efficiency and convergence.
Findings
Reduces synchronization delays significantly.
Improves training throughput and convergence speed.
Provides theoretical estimates for communication requirements.
Abstract
Scaling distributed training of Large Language Models (LLMs) requires not only algorithmic advances but also efficient utilization of heterogeneous hardware resources. While existing methods such as DiLoCo have demonstrated promising results, they often fail to fully exploit computational clusters under dynamic workloads. To address this limitation, we propose a three-stage method that combines Multi-Instance Training (MIT), Adaptive Batched DiLoCo, and switch mode mechanism. MIT allows individual nodes to run multiple lightweight training streams with different model instances in parallel and merge them to combine knowledge, increasing throughput and reducing idle time. Adaptive Batched DiLoCo dynamically adjusts local batch sizes to balance computation and communication, substantially lowering synchronization delays. Switch mode further stabilizes training by seamlessly introducing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
