TL;DR
This paper introduces DataStates-LLM, a lazy asynchronous checkpointing method that significantly reduces I/O overheads during large language model training, enabling faster and more scalable checkpointing at high frequencies.
Contribution
It proposes a novel multi-level lazy asynchronous checkpointing approach that leverages tensor immutability to minimize I/O interference during LLM training.
Findings
Up to 48× faster checkpointing performance.
Achieved 2.2× reduction in total training time.
Effective at scales up to 180 GPUs.
Abstract
LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
