Why Atomicity Matters to AI/ML Infrastructure: Snapshots, Firmware Updates, and the Cost of the Forward-In-Time-Only Category Mistake
Paul Borrill

TL;DR
This paper critically examines the assumptions of atomic snapshots and seamless infrastructure updates in AI/ML systems, revealing fundamental limitations due to the Forward-In-Time-Only category mistake and proposing a protocol to address these issues.
Contribution
It formalizes the FITO category mistake, models checkpoint inconsistencies, and introduces a bilateral convergence protocol that avoids the need for atomic snapshots.
Findings
Atomicity events are measure-zero and exponentially rare.
Checkpoint inconsistencies are formalized on an epoch lattice.
Atomic deployment of firmware updates requires unattainable common knowledge.
Abstract
Large-scale AI/ML training systems depend on two assumptions that are rarely examined: (1) that checkpoints represent atomic snapshots of global training state, and (2) that infrastructure updates can be applied without inducing mixed-protocol cluster states. Both assumptions are instances of a deeper structural error: the Forward-In-Time-Only (FITO) category mistake, which confuses protocol convergence properties with temporal predicates. We formalize this confusion as a type error: the identification of a temporal snapshot with a convergence property . We model checkpoint execution in a process-algebraic framework and prove that under asynchronous composition with crash-recovery failures, no temporal instant can serve as an atomicity boundary. We reformulate checkpoint inconsistency on an epoch lattice and show that atomicity is a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Software-Defined Networks and 5G · Software System Performance and Reliability
