DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes

Mogens Henrik From; Jacob Nielsen; Lukas Galke Poech; Peter Schneider-Kamp

arXiv:2502.06728·cs.LG·November 18, 2025

DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes

Mogens Henrik From, Jacob Nielsen, Lukas Galke Poech, Peter Schneider-Kamp

PDF

Open Access

TL;DR

DeToNATION introduces FlexDeMo, a hybrid sharded data parallel training method that reduces communication overhead while maintaining model accuracy, enabling faster training of large neural networks across multiple nodes and accelerators.

Contribution

The paper proposes FlexDeMo, a novel hybrid sharded data parallel training strategy that extends Decoupled Momentum to multi-accelerator, multi-node settings, and introduces the DeToNATION framework for flexible distributed training.

Findings

01

FlexDeMo achieves similar validation loss as full gradient synchronization methods.

02

FlexDeMo significantly reduces training time compared to traditional methods.

03

The approach is effective across language and vision tasks.

Abstract

Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, DeMo assumes that models fit on a single accelerator. We relax this assumption and introduce FlexDeMo, whereby nodes fully shard model parameters locally between different accelerators, while inter-node communication is reduced by synchronizing only fast-moving components instead of the full gradients -- resulting in a hybrid sharded data parallel training strategy. We further introduce a framework, denoted as DeToNATION, that generalizes DeMo, FlexDeMo, and other popular distributed training schemes such as DiLoCo -- introducing new variations of replication schemes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Path Planning Algorithms

MethodsAdamW · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings