A Recommender System for Scientific Datasets and Analysis Pipelines
Mandana Mazaheri, Gregory Kiar, Tristan Glatard

TL;DR
This paper presents a provenance-based recommender system for scientific datasets and analysis pipelines that improves resource discoverability and compatibility prediction, outperforming expert judgment in open neuroscience data sharing.
Contribution
It introduces a collaborative filtering approach leveraging provenance records to recommend compatible datasets and pipelines, demonstrating its effectiveness with real-world neuroscience data.
Findings
Recommender system achieves AUC=0.83, significantly better than chance.
Outperforms domain experts' recommendations with AUC=0.63.
Provenance-based recommendations capture technical interaction details often overlooked by experts.
Abstract
Scientific datasets and analysis pipelines are increasingly being shared publicly in the interest of open science. However, mechanisms are lacking to reliably identify which pipelines and datasets can appropriately be used together. Given the increasing number of high-quality public datasets and pipelines, this lack of clear compatibility threatens the findability and reusability of these resources. We investigate the feasibility of a collaborative filtering system to recommend pipelines and datasets based on provenance records from previous executions. We evaluate our system using datasets and pipelines extracted from the Canadian Open Neuroscience Platform, a national initiative for open neuroscience. The recommendations provided by our system (AUC) are significantly better than chance and outperform recommendations made by domain experts using their previous knowledge as well…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Biomedical Text Mining and Ontologies · Research Data Management Practices
