Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

Aaron Defazio; Konstantin Mishchenko; Parameswaran Raman; Hao-Jun Michael Shi; Lin Xiao

arXiv:2512.17131·cs.LG·March 2, 2026

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Generalized Primal Averaging (GPA), a new optimizer that accelerates training of large language models and vision transformers by unifying and improving upon existing averaging-based methods, reducing memory use and increasing speed.

Contribution

GPA extends Nesterov's method to unify recent averaging optimizers, simplifies implementation, and provides theoretical convergence guarantees while improving empirical training speed.

Findings

01

GPA outperforms DiLoCo and AdamW in training speed for LLMs.

02

GPA achieves significant speedups on ImageNet ViT workloads.

03

GPA maintains convergence guarantees for base optimizers with $O(\sqrt{T})$ regret.

Abstract

We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method that unifies and generalizes recent averaging-based optimizers like single-worker DiLoCo and Schedule-Free, within a non-distributed setting. While DiLoCo relies on a memory-intensive two-loop structure to periodically aggregate pseudo-gradients using Nesterov momentum, GPA eliminates this complexity by decoupling Nesterov's interpolation constants to enable smooth iterate averaging at every step. Structurally, GPA resembles Schedule-Free but replaces uniform averaging with exponential moving averaging. Empirically, GPA consistently outperforms single-worker DiLoCo and AdamW with reduced memory overhead. GPA achieves speedups of 8.71%, 10.13%, and 9.58% over the AdamW baseline in terms of steps to reach target validation loss for Llama-160M, 1B, and 8B models, respectively. Similarly, on the ImageNet ViT…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

* The method accelerates the base optimizer AdamW by a large margin and does not require the storage of a "slow weight" as in DiLoCo. * The method is shown to be convergent under the convex case.

Weaknesses

* If I understand it correctly, the formulation (3) is the classical Polyak momentum rather than Nesterov momentum. The Pytorch's implementation of Nesterov accelerated gradient is inconsistent with the document it refers to ([1, equation 3]), where the momentum aggregates the gradient evaluated at $\theta_t + \mu v_t$ instead of $\theta_t$. These two momentum algorithms have different theoretical convergence rate in convex case, and usually exhibit substantial convergence gap in practice. I sug

Reviewer 02Rating 4Confidence 4

Strengths

- GPA is precisely specified with both direct and memory-efficient forms; implementation notes (extra buffer, reconstruction) are helpful. - The smoothing view and the $H\leftrightarrow \mu_x$ heuristic connect two families (Lookahead/DiLoCo vs iterate-averaging). - On Llama-160M, GPA improves final loss over AdamW and DiLoCo at matched effective inner steps, with reported peak step-speedup ~38%.

Weaknesses

- Theory does not justify the central empirical claims. The bound is (i) convex-only, (ii) on the average iterate, while the method uses a schedule and returns the last iterate; (iii) does not quantify when GPA strictly improves over the base beyond informal remarks about negative Bregman terms. Also, theoretical novelty is incremental. GPA’s decoupling of $\mu_x$ and $\mu_y$ is a straightforward extension of primal averaging and Schedule-Free/EMA-style iterate averaging, but Theorem 1 does not

Reviewer 03Rating 4Confidence 3

Strengths

This paper proposed GPA scheme aims to simplify the DiLoCo optimizer and get rid of the inner optimization loop. The author also provide theoretical guarantees on its convergence.

Weaknesses

I think the presentation of this paper should be improved. After introducing the GPA, its analysis and connections to other methods are hidden in the text, and sometimes make claims without revealing the logic behind it. For example, line 259-260, why we have to use learning rate scheduler? Therefore, I have to admit that I do not fully understand every details of this paper.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Advanced Neural Network Applications