AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models

Nikolay Kutuzov; Makar Baderko; Stepan Kulibaba; Artem Dzhalilov; Daniel Bobrov; Maxim Mashtaler; Alexander Gasnikov

arXiv:2508.18182·cs.LG·August 26, 2025

AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models

Nikolay Kutuzov, Makar Baderko, Stepan Kulibaba, Artem Dzhalilov, Daniel Bobrov, Maxim Mashtaler, Alexander Gasnikov

PDF

TL;DR

AdLoCo introduces an adaptive batching approach with multi-instance training and switch mode to enhance communication efficiency and convergence speed in large language model training on heterogeneous hardware.

Contribution

The paper presents a novel three-stage method combining MIT, adaptive batching, and switch mode to improve distributed LLM training efficiency and convergence.

Findings

01

Reduces synchronization delays significantly.

02

Improves training throughput and convergence speed.

03

Provides theoretical estimates for communication requirements.

Abstract

Scaling distributed training of Large Language Models (LLMs) requires not only algorithmic advances but also efficient utilization of heterogeneous hardware resources. While existing methods such as DiLoCo have demonstrated promising results, they often fail to fully exploit computational clusters under dynamic workloads. To address this limitation, we propose a three-stage method that combines Multi-Instance Training (MIT), Adaptive Batched DiLoCo, and switch mode mechanism. MIT allows individual nodes to run multiple lightweight training streams with different model instances in parallel and merge them to combine knowledge, increasing throughput and reducing idle time. Adaptive Batched DiLoCo dynamically adjusts local batch sizes to balance computation and communication, substantially lowering synchronization delays. Switch mode further stabilizes training by seamlessly introducing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.