Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs
Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao

TL;DR
This paper introduces Generalized Primal Averaging (GPA), a new optimizer that accelerates training of large language models and vision transformers by unifying and improving upon existing averaging-based methods, reducing memory use and increasing speed.
Contribution
GPA extends Nesterov's method to unify recent averaging optimizers, simplifies implementation, and provides theoretical convergence guarantees while improving empirical training speed.
Findings
GPA outperforms DiLoCo and AdamW in training speed for LLMs.
GPA achieves significant speedups on ImageNet ViT workloads.
GPA maintains convergence guarantees for base optimizers with $O(\sqrt{T})$ regret.
Abstract
We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method that unifies and generalizes recent averaging-based optimizers like single-worker DiLoCo and Schedule-Free, within a non-distributed setting. While DiLoCo relies on a memory-intensive two-loop structure to periodically aggregate pseudo-gradients using Nesterov momentum, GPA eliminates this complexity by decoupling Nesterov's interpolation constants to enable smooth iterate averaging at every step. Structurally, GPA resembles Schedule-Free but replaces uniform averaging with exponential moving averaging. Empirically, GPA consistently outperforms single-worker DiLoCo and AdamW with reduced memory overhead. GPA achieves speedups of 8.71%, 10.13%, and 9.58% over the AdamW baseline in terms of steps to reach target validation loss for Llama-160M, 1B, and 8B models, respectively. Similarly, on the ImageNet ViT…
Peer Reviews
Decision·Submitted to ICLR 2026
* The method accelerates the base optimizer AdamW by a large margin and does not require the storage of a "slow weight" as in DiLoCo. * The method is shown to be convergent under the convex case.
* If I understand it correctly, the formulation (3) is the classical Polyak momentum rather than Nesterov momentum. The Pytorch's implementation of Nesterov accelerated gradient is inconsistent with the document it refers to ([1, equation 3]), where the momentum aggregates the gradient evaluated at $\theta_t + \mu v_t$ instead of $\theta_t$. These two momentum algorithms have different theoretical convergence rate in convex case, and usually exhibit substantial convergence gap in practice. I sug
- GPA is precisely specified with both direct and memory-efficient forms; implementation notes (extra buffer, reconstruction) are helpful. - The smoothing view and the $H\leftrightarrow \mu_x$ heuristic connect two families (Lookahead/DiLoCo vs iterate-averaging). - On Llama-160M, GPA improves final loss over AdamW and DiLoCo at matched effective inner steps, with reported peak step-speedup ~38%.
- Theory does not justify the central empirical claims. The bound is (i) convex-only, (ii) on the average iterate, while the method uses a schedule and returns the last iterate; (iii) does not quantify when GPA strictly improves over the base beyond informal remarks about negative Bregman terms. Also, theoretical novelty is incremental. GPA’s decoupling of $\mu_x$ and $\mu_y$ is a straightforward extension of primal averaging and Schedule-Free/EMA-style iterate averaging, but Theorem 1 does not
This paper proposed GPA scheme aims to simplify the DiLoCo optimizer and get rid of the inner optimization loop. The author also provide theoretical guarantees on its convergence.
I think the presentation of this paper should be improved. After introducing the GPA, its analysis and connections to other methods are hidden in the text, and sometimes make claims without revealing the logic behind it. For example, line 259-260, why we have to use learning rate scheduler? Therefore, I have to admit that I do not fully understand every details of this paper.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Advanced Neural Network Applications
