Crash-Consistent Checkpointing for AI Training on macOS/APFS
Juha Jeon

TL;DR
This paper investigates checkpointing protocols for AI training on macOS/APFS, proposing integrity validation methods that detect corruption with high accuracy, and analyzing the performance trade-offs involved.
Contribution
It introduces a format-agnostic integrity guard and compares three checkpoint installation modes, providing insights into reliability and performance trade-offs for AI training.
Findings
Integrity guard detects 99.8-100% corruptions with no false positives.
Performance overhead ranges from 56.5% to 570.6% depending on mode.
Atomic_dirsync mode offers higher durability at increased performance cost.
Abstract
Deep learning training relies on periodic checkpoints to recover from failures, but unsafe checkpoint installation can leave corrupted files on disk. This paper presents an experimental study of checkpoint installation protocols and integrity validation for AI training on macOS/APFS. We implement three write modes with increasing durability guarantees: unsafe (baseline, no fsync), atomic_nodirsync (file-level durability via fsync()), and atomic_dirsync (file + directory durability). We design a format-agnostic integrity guard using SHA-256 checksums with automatic rollback. Through controlled experiments including crash injection (430 unsafe-mode trials) and corruption injection (1,600 atomic-mode trials), we demonstrate that the integrity guard detects 99.8-100% of corruptions with zero false positives. Performance overhead is 56.5-108.4% for atomic_nodirsync and 84.2-570.6% for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed systems and fault tolerance · Security and Verification in Computing
