DeMo: Decoupled Momentum Optimization

Bowen Peng; Lizhang Chen; Baiyu Su; Jeffrey Quesnelle; Diederik P. Kingma; Qiang Liu

arXiv:2411.19870·cs.LG·February 10, 2026

DeMo: Decoupled Momentum Optimization

Bowen Peng, Lizhang Chen, Baiyu Su, Jeffrey Quesnelle, Diederik P. Kingma, Qiang Liu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

DeMo introduces a novel optimizer that significantly reduces communication in distributed neural network training by decoupling momentum updates and applying sparsification, enabling efficient large-scale training with minimal accuracy loss.

Contribution

DeMo presents a new momentum optimizer that reduces communication overhead by decoupling updates and applying sparsification, maintaining convergence and broad applicability.

Findings

01

Reduces per-step communication by up to 85x compared to AdamW-DDP.

02

Maintains comparable loss and accuracy in large language models.

03

Enables efficient training across multi-datacenter and Ethernet setups.

Abstract

Scaling neural network training increasingly depends on synchronous data-parallelism, yet full-precision gradient all-reduce imposes a severe communication bottleneck. We propose Decoupled Momentum Optimization (DeMo), a drop-in replacement for any momentum-based optimizers that significantly reduces the communication bandwidth while maintaining convergence. DeMo (i) decouples local momentum updates, (ii) applies a fast orthonormal transform (e.g., DCT) followed by top-k sparsification, and (iii) reuses the momentum buffer as error feedback via momentum subtraction. This design reduces per-step communication by up to two orders of magnitude with minimal computational overhead. Experiments on 300M and 1B-parameter DeMo language models show DeMo transmits up to 85x less data per GPU than AdamW-DDP while achieving comparable loss and accuracy. DeMo is topology-agnostic and enables training…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- Practical Impact: Achieves dramatic communication reduction (up to 85×) while maintaining model quality, which is highly valuable for distributed training scenarios with limited bandwidth. - Clean Design: The three-component approach (decoupled momentum, DCT+top-k, momentum subtraction) is conceptually clear and builds naturally on existing optimization principles.

Weaknesses

- Limited Scalability Analysis: The download bandwidth scales with the number of workers, which could become prohibitive at large scale; but all experiments use 64 GPUs; scaling behavior to hundreds or thousands of GPUs remains unclear. - Multi-datacenter: The method is positioned for "multi-datacenter" training but only tested within single datacenters. - Baseline Comparisons: No comparison with other communication-efficient methods (e.g., gradient quantization, other sparsification approaches

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper's core idea of decoupling momentum into dominant components (for communication) and non-dominant components (for local updates) presents an effective and intuitive paradigm for reducing communication overhead. 2. The application of DCT as the mechanism for sparsification, rather than operating in the raw gradient or momentum space, is a novel approach that provides a new perspective on structured compression.

Weaknesses

## 1. The theoretical analysis, particularly the convergence rate presented in Theorem 1, warrants further discussion as it does not appear to be optimal. For compressed stochastic optimization algorithms in the non-convex setting, convergence rates can achieve $\mathcal{O}(\omega/\sqrt{NT})$ (where $\omega$ is the variance bound of the compression estimator, see [1]). Notably, this rate is dimension-free ($D$ is not in the numerator). The dimension $D$ typically appears when analyzing the tota

Reviewer 03Rating 4Confidence 5

Strengths

The primary objective of this paper is to democratize the training and fine-tuning of large models by minimizing communication volume, which obviates the requirement for expensive, high-performance networking infrastructure.

Weaknesses

Limited Perceived Novelty: The claimed novelties of DeMo appear to build upon existing work. For instance, [1] also compresses the momentum term rather than the raw gradient. Even earlier literature has demonstrated that compressing the total model update, rather than the intermediate gradient, can maintain competitive performance while dramatically reducing communication overhead. Furthermore, the technique of using the momentum buffer to store error feedback for memory efficiency is also emplo

Code & Models

Repositories

bloc97/demo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Path Planning Algorithms

MethodsAdamW