What happens when nanochat meets DiLoCo?
Alexander Acker, Soeren Becker, Sasho Nedelkoski, Dominik Scheinert, Odej Kao, Philipp Wiesner

TL;DR
This paper explores communication-efficient training of language models using the DiLoCo algorithm with nanochat, revealing trade-offs in convergence and downstream performance due to asynchronous updates.
Contribution
It introduces a lightweight implementation of DiLoCo on nanochat, providing a controlled environment to study communication-constrained training effects on LLMs.
Findings
DiLoCo achieves stable convergence and competitive pretraining loss.
Mid-training and SFT scores are worse with DiLoCo, indicating performance degradation.
Irreversible representation drift occurs when switching from DiLoCo to DDP, impairing downstream tasks.
Abstract
Although LLM training is typically centralized with high-bandwidth interconnects and large compute budgets, emerging methods target communication-constrained training in distributed environments. The model trade-offs introduced by this shift remain underexplored, and our goal is to study them. We use the open-source nanochat project, a compact 8K-line full-stack ChatGPT-like implementation containing tokenization, pretraining, fine-tuning, and serving, as a controlled baseline. We implement the DiLoCo algorithm as a lightweight wrapper over nanochat's training loop, performing multiple local steps per worker before synchronization with an outer optimizer, effectively reducing communication by orders of magnitude. This inner-outer training is compared against a standard data-parallel (DDP) setup. Because nanochat is small and inspectable, it enables controlled pipeline adaptations and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications
