Supporting Workflow Reproducibility by Linking Bioinformatics Tools across Papers and Executable Code
Cl\'emence Sebe, Olivier Ferret, Aur\'elie N\'ev\'eol, Mahdi Esmailoghli, Ulf Leser, Sarah Cohen-Boulakia

TL;DR
This paper introduces CoPaLink, an automated method that links bioinformatics tools mentioned in scientific texts and workflow code to enhance reproducibility and understanding of biological data analysis workflows.
Contribution
The paper presents a novel integrated approach combining NER and entity linking to connect workflow descriptions with executable code in bioinformatics.
Findings
High individual F1-measure (84-89) for NER components
Joint accuracy of 66% on Nextflow workflows
Effective bridging of narrative descriptions and code implementations
Abstract
Motivation: The rapid growth of biological data has intensified the need for transparent, reproducible, and well-documented computational workflows. The ability to clearly connect the steps of a workflow in the code with their description in a paper would improve workflow understanding, support reproducibility, and facilitate reuse. This task requires the linking of Bioinformatics tools in workflow code with their mentions in a published workflow description. Results: We present CoPaLink, an automated approach that integrates three components: Named Entity Recognition (NER) for identifying tool mentions in scientific text, NER for tool mentions in workflow code, and entity linking grounded on Bioinformatics knowledge bases. We propose approaches for all three steps achieving a high individual F1-measure (84 - 89) and a joint accuracy of 66 when evaluated on Nextflow workflows using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Biomedical Text Mining and Ontologies · Research Data Management Practices
