DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes
Mogens Henrik From, Jacob Nielsen, Lukas Galke Poech, Peter Schneider-Kamp

TL;DR
DeToNATION introduces FlexDeMo, a hybrid sharded data parallel training method that reduces communication overhead while maintaining model accuracy, enabling faster training of large neural networks across multiple nodes and accelerators.
Contribution
The paper proposes FlexDeMo, a novel hybrid sharded data parallel training strategy that extends Decoupled Momentum to multi-accelerator, multi-node settings, and introduces the DeToNATION framework for flexible distributed training.
Findings
FlexDeMo achieves similar validation loss as full gradient synchronization methods.
FlexDeMo significantly reduces training time compared to traditional methods.
The approach is effective across language and vision tasks.
Abstract
Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, DeMo assumes that models fit on a single accelerator. We relax this assumption and introduce FlexDeMo, whereby nodes fully shard model parameters locally between different accelerators, while inter-node communication is reduced by synchronizing only fast-moving components instead of the full gradients -- resulting in a hybrid sharded data parallel training strategy. We further introduce a framework, denoted as DeToNATION, that generalizes DeMo, FlexDeMo, and other popular distributed training schemes such as DiLoCo -- introducing new variations of replication schemes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotic Path Planning Algorithms
MethodsAdamW · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
