Understanding LLM Checkpoint/Restore I/O Strategies and Patterns

Mikaila J. Gossman; Avinash Maurya; Bogdan Nicolae; Jon C. Calhoun

arXiv:2512.24511·cs.DC·January 1, 2026

Understanding LLM Checkpoint/Restore I/O Strategies and Patterns

Mikaila J. Gossman, Avinash Maurya, Bogdan Nicolae, Jon C. Calhoun

PDF

Open Access

TL;DR

This paper investigates I/O strategies for checkpointing large language models, highlighting the importance of aggregation and coalescing to improve throughput and reduce bottlenecks in big-data I/O workflows.

Contribution

It provides a detailed analysis of I/O bottlenecks in LLM checkpointing and proposes microbenchmark-based strategies using liburing to optimize performance.

Findings

01

Uncoalesced small-buffer I/O halves throughput.

02

Aggregation and alignment restore bandwidth and reduce metadata overhead.

03

Our approach outperforms existing engines with up to 7.6x higher throughput.

Abstract

As LLMs and foundation models scale, checkpoint/restore has become a critical pattern for training and inference. With 3D parallelism (tensor, pipeline, data), checkpointing involves many processes, each managing numerous tensors of varying shapes and sizes, that must be persisted frequently to stable storage (e.g., parallel file systems). This turns checkpoint/restore into a big-data I/O problem characterized by volume, variety, and velocity. The workflow must traverse the full storage stack -- from GPU memory through host memory and local storage to external repositories -- whose tiers differ by orders of magnitude in performance, creating bottlenecks under concurrency even with asynchronous flush/prefetch. Kernel-accelerated I/O libraries such as \texttt{liburing} may mitigate these issues versus POSIX, but their effectiveness for LLM checkpointing remains underexplored. We develop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Distributed systems and fault tolerance · Parallel Computing and Optimization Techniques