Predicting computational reproducibility of data analysis pipelines in   large population studies using collaborative filtering

Soudabeh Barghi; Lalet Scaria; Ali Salari; Tristan Glatard

arXiv:1809.10139·stat.ME·September 28, 2018

Predicting computational reproducibility of data analysis pipelines in large population studies using collaborative filtering

Soudabeh Barghi, Lalet Scaria, Ali Salari, Tristan Glatard

PDF

1 Repo

TL;DR

This paper introduces a collaborative filtering approach to predict the reproducibility of data analysis pipelines in large population studies, reducing computational effort while maintaining accuracy.

Contribution

It formulates reproducibility prediction as a collaborative filtering problem and evaluates six training set strategies, highlighting an effective sampling method.

Findings

01

Random File Numbers (Uniform) sampling predicts reproducibility accurately.

02

Including file and subject biases improves prediction performance.

03

The method significantly speeds up reproducibility assessments with minimal accuracy loss.

Abstract

Evaluating the computational reproducibility of data analysis pipelines has become a critical issue. It is, however, a cumbersome process for analyses that involve data from large populations of subjects, due to their computational and storage requirements. We present a method to predict the computational reproducibility of data analysis pipelines in large population studies. We formulate the problem as a collaborative filtering process, with constraints on the construction of the training set. We propose 6 different strategies to build the training set, which we evaluate on 2 datasets, a synthetic one modeling a population with a growing number of subject types, and a real one obtained with neuroinformatics pipelines. Results show that one sampling method, "Random File Numbers (Uniform)" is able to predict computational reproducibility with a good accuracy. We also analyze the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

big-data-lab-team/paper-reproducibility-collaborative-filtering
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.