File-based localization of numerical perturbations in data analysis pipelines
Ali Salari, Gregory Kiar, Lindsay Lewis, Alan C. Evans, Tristan, Glatard

TL;DR
This paper introduces Spot, a tool that identifies sources of numerical perturbations in data analysis pipelines, helping to understand and mitigate reproducibility issues caused by computational instabilities.
Contribution
The paper presents Spot, a novel system that detects numerical differences in pipelines through system-call interception, without requiring pipeline modifications.
Findings
Linear and non-linear registration cause most numerical instabilities
Spot successfully reconstructs provenance graphs for comparison
Application to Human Connectome Project pipelines confirms known instability sources
Abstract
Data analysis pipelines are known to be impacted by computational conditions, presumably due to the creation and propagation of numerical errors. While this process could play a major role in the current reproducibility crisis, the precise causes of such instabilities and the path along which they propagate in pipelines are unclear. We present Spot, a tool to identify which processes in a pipeline create numerical differences when executed in different computational conditions. Spot leverages system-call interception through ReproZip to reconstruct and compare provenance graphs without pipeline instrumentation. By applying Spot to the structural pre-processing pipelines of the Human Connectome Project, we found that linear and non-linear registration are the cause of most numerical instabilities in these pipelines, which confirms previous findings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
