Understanding LLM Checkpoint/Restore I/O Strategies and Patterns
Mikaila J. Gossman, Avinash Maurya, Bogdan Nicolae, Jon C. Calhoun

TL;DR
This paper investigates I/O strategies for checkpointing large language models, highlighting the importance of aggregation and coalescing to improve throughput and reduce bottlenecks in big-data I/O workflows.
Contribution
It provides a detailed analysis of I/O bottlenecks in LLM checkpointing and proposes microbenchmark-based strategies using liburing to optimize performance.
Findings
Uncoalesced small-buffer I/O halves throughput.
Aggregation and alignment restore bandwidth and reduce metadata overhead.
Our approach outperforms existing engines with up to 7.6x higher throughput.
Abstract
As LLMs and foundation models scale, checkpoint/restore has become a critical pattern for training and inference. With 3D parallelism (tensor, pipeline, data), checkpointing involves many processes, each managing numerous tensors of varying shapes and sizes, that must be persisted frequently to stable storage (e.g., parallel file systems). This turns checkpoint/restore into a big-data I/O problem characterized by volume, variety, and velocity. The workflow must traverse the full storage stack -- from GPU memory through host memory and local storage to external repositories -- whose tiers differ by orders of magnitude in performance, creating bottlenecks under concurrency even with asynchronous flush/prefetch. Kernel-accelerated I/O libraries such as \texttt{liburing} may mitigate these issues versus POSIX, but their effectiveness for LLM checkpointing remains underexplored. We develop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed systems and fault tolerance · Parallel Computing and Optimization Techniques
