MuLoCo: Muon is a practical inner optimizer for DiLoCo

Benjamin Th\'erien; Xiaolong Huang; Aaron Defazio; Irina Rish; Eugene Belilovsky

arXiv:2505.23725·cs.LG·February 26, 2026

MuLoCo: Muon is a practical inner optimizer for DiLoCo

Benjamin Th\'erien, Xiaolong Huang, Aaron Defazio, Irina Rish, Eugene Belilovsky

PDF

Open Access

TL;DR

MuLoCo introduces a practical inner optimizer for DiLoCo that improves large language model training efficiency and performance, especially as the number of workers increases, by producing more accurate pseudogradients.

Contribution

This work demonstrates that using Muon as the inner optimizer in DiLoCo enhances training performance and scalability across various model sizes, outperforming traditional optimizers like AdamW.

Findings

01

MuLoCo yields more directionally correct pseudogradients with increasing workers.

02

It outperforms DiLoCo and AdamW in training large language models across multiple scales.

03

MuLoCo maintains high performance with long synchronization intervals and quantization.

Abstract

DiLoCo is a powerful framework for training large language models (LLMs), enabling larger optimal batch sizes and increased accelerator utilization under networking constraints. However, DiLoCo's performance has been shown to degrade as the number of workers (K) increases (Charles et al., 2025). In this work, we posit that a related but often overlooked factor in DiLoCo's behavior is the choice of inner optimizer, which shapes the pseudogradient used by the outer optimizer. Given the recent success of Muon relative to AdamW for data parallel (DP) training, we examine how Muon's normalized optimizer steps can affect the pseudogradient's quality. We find that, relative to AdamW, Muon yields more directionally correct pseudogradients as the number of workers (K) increases. In our experiments pre-training language models, we conduct extensive hyperparameter tuning across 150M, 416M, 914M,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Big Data and Digital Economy