ATOM: Asynchronous Training of Massive Models for Deep Learning in a   Decentralized Environment

Xiaofeng Wu; Jia Rao; Wei Chen

arXiv:2403.10504·cs.DC·March 18, 2024·1 cites

ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment

Xiaofeng Wu, Jia Rao, Wei Chen

PDF

Open Access

TL;DR

ATOM introduces a decentralized, asynchronous training framework for large language models that leverages cost-effective hardware, enabling scalable training without central bottlenecks and outperforming traditional methods in slow network environments.

Contribution

The paper presents a novel decentralized training framework, ATOM, which allows training large models on commodity hardware with asynchronous model swapping and optimized partitioning strategies.

Findings

01

Up to 20x training efficiency improvement over existing decentralized pipeline methods.

02

Successfully trains GPT-3 configurations on consumer-grade hardware.

03

Avoids central failure points inherent in pipeline parallelism.

Abstract

The advent of the Transformer architecture has propelled the growth of natural language processing (NLP) models, leading to remarkable achievements in numerous NLP tasks. Yet, the absence of specialized hardware like expansive GPU memory and high-speed interconnects poses challenges for training large-scale models. This makes it daunting for many users to experiment with pre-training and fine-tuning large language models (LLMs). In this study, we introduce \atom, a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting using cost-effective hardware, including consumer-grade GPUs and Ethernet. Unlike conventional model partitioning methods that distribute sub-models across GPUs, \atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications