DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language   Models

Avinash Maurya; Robert Underwood; M. Mustafa Rafique; Franck Cappello,; Bogdan Nicolae

arXiv:2406.10707·cs.DC·June 18, 2024

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello,, Bogdan Nicolae

PDF

1 Repo

TL;DR

This paper introduces DataStates-LLM, a lazy asynchronous checkpointing method that significantly reduces I/O overheads during large language model training, enabling faster and more scalable checkpointing at high frequencies.

Contribution

It proposes a novel multi-level lazy asynchronous checkpointing approach that leverages tensor immutability to minimize I/O interference during LLM training.

Findings

01

Up to 48× faster checkpointing performance.

02

Achieved 2.2× reduction in total training time.

03

Effective at scales up to 180 GPUs.

Abstract

LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

datastates/datastates-llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.