Communication Efficient LLM Pre-training with SparseLoCo
Amir Sarfi, Benjamin Th\'erien, Joel Lidin, Eugene Belilovsky

TL;DR
SparseLoCo is a novel training algorithm for large language models that combines error feedback, Top-k sparsification, and 2-bit quantization to drastically reduce communication costs while maintaining or improving performance.
Contribution
It introduces SparseLoCo, the first method to effectively combine sparsification and quantization with error feedback for efficient LLM pre-training.
Findings
Achieves 1-3% sparsity with better performance than full-precision methods.
Reduces communication by up to 97% during training.
Improves model performance through sparse aggregation.
Abstract
Communication-efficient distributed training algorithms have received considerable interest recently due to their benefits for training Large Language Models (LLMs) in bandwidth-constrained settings, such as across datacenters and over the internet. Despite reducing communication frequency, these methods still typically require communicating a full copy of the model's gradients-resulting in a communication bottleneck even for cross-datacenter links. Furthermore, they can slightly degrade performance compared to a naive AdamW DDP baseline. While quantization is often applied to reduce the pseudo-gradient's size, in the context of LLM pre-training, existing approaches have been unable to additionally leverage sparsification and have obtained limited quantization. In this work, we introduce SparseLoCo, a communication-efficient training algorithm for LLMs that effectively leverages error…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper presents a novel and highly practical solution (SparseLoCo) that effectively unifies two distinct lines of communication-efficient training: infrequent communication (like DiLoCo) and aggressive gradient compression (sparsification + quantization). This is a significant contribution for training in highly bandwidth-constrained settings, such as cross-datacenter or internet-based collaboration. 2. The core insight that DiLoCo's global outer momentum can be successfully replaced by a
1. The paper would benefit from a brief theoretical analysis clarifying the role of the local inner updates ($H$ steps) in the optimization process. The authors compare SparseLoCo to DeMo, which (despite also updating non-dominant information locally) provides a theoretical justification that convergence can be achieved even with minimal global information, as long as it represents the dominant components of the momentum. SparseLoCo explicitly designs a local inner loop before processing and com
The paper is well-written and the method is clearly presented. The experiments convincingly validate that the proposed approach effectively reduces communication overhead.
I have major concerns regarding the novelty and experimental validation of this work, which I find to be incremental. 1. Limited Novelty: The core technique of error-feedback is a well-established standard for achieving communication efficiency. The paper does not convincingly demonstrate a significant algorithmic advancement beyond this. 2. Insufficient Evidence for Acceleration: The experiments fail to prove that the method accelerates standard LLM training in practical settings (e.g., on 8
The authors achieve an exceptionally high gradient compression ratio, which reduces communication overhead, and conduct a detailed comparison with DiLoCo, demonstrating that their method maintains performance even under high sparsity.
- While the paper presents interesting ideas, its novelty compared to DeMO could be more clearly highlighted. It would be helpful if the authors more explicitly elaborated on the specific advancements beyond DeMO. - The current writing could be further polished to improve overall readability and flow.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Speech Recognition and Synthesis
