TL;DR
This paper introduces a collaborative filtering approach to predict the reproducibility of data analysis pipelines in large population studies, reducing computational effort while maintaining accuracy.
Contribution
It formulates reproducibility prediction as a collaborative filtering problem and evaluates six training set strategies, highlighting an effective sampling method.
Findings
Random File Numbers (Uniform) sampling predicts reproducibility accurately.
Including file and subject biases improves prediction performance.
The method significantly speeds up reproducibility assessments with minimal accuracy loss.
Abstract
Evaluating the computational reproducibility of data analysis pipelines has become a critical issue. It is, however, a cumbersome process for analyses that involve data from large populations of subjects, due to their computational and storage requirements. We present a method to predict the computational reproducibility of data analysis pipelines in large population studies. We formulate the problem as a collaborative filtering process, with constraints on the construction of the training set. We propose 6 different strategies to build the training set, which we evaluate on 2 datasets, a synthetic one modeling a population with a growing number of subject types, and a real one obtained with neuroinformatics pipelines. Results show that one sampling method, "Random File Numbers (Uniform)" is able to predict computational reproducibility with a good accuracy. We also analyze the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
