Data Version Management and Machine-Actionable Reproducibility for HPC

Andreas Kn\"upfer; Timothy J. Callow

arXiv:2505.06558·cs.DC·September 29, 2025

Data Version Management and Machine-Actionable Reproducibility for HPC

Andreas Kn\"upfer, Timothy J. Callow

PDF

TL;DR

This paper introduces an extension to DataLad that enables data version control and machine-actionable reproducibility in HPC environments using SLURM, addressing compatibility issues with batch processing and improving efficiency.

Contribution

The paper presents a novel extension to DataLad that makes it compatible with HPC batch processing systems like SLURM, enabling concurrent job scheduling on shared data repositories.

Findings

01

Enables multiple HPC jobs to access data repositories concurrently.

02

Improves efficiency on parallel file systems in HPC environments.

03

Ensures reproducibility of data processing workflows in HPC settings.

Abstract

We present a solution for research data version control and machine-actionable reproducibility of data processing for High Performance Computing (HPC) environments and the SLURM batch scheduler. Both aspects are important for research data management and the DataLad tool provides both based on the very prevalent git version control system. However, it is incompatible with HPC batch processing. The presented extension makes it compatible with HPC batch processing with the SLURM scheduler. It solves the fundamental incompatibility so that multiple jobs can be scheduled concurrently on the same data repository. It also avoids inefficient behavior patterns which may emerge on parallel file systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.