It's Not Just GitHub: Identifying Data and Software Sources Included in Publications
Emily Escamilla, Lamia Salsabil, Martin Klein, Jian Wu, Michele C., Weigle, Michael L. Nelson

TL;DR
This paper presents a hybrid classification method to identify open-access data and software URIs in scholarly publications, enhancing preservation and reproducibility by recognizing diverse hosting platforms beyond well-known repositories.
Contribution
It introduces a hybrid classifier that effectively detects OADS URIs across numerous hosting platforms, including niche and less common ones, improving discoverability and preservation.
Findings
33% of OADS URIs are from Git hosting platforms.
Nearly 50,000 unique hostnames host non-GHP OADS URIs.
Hybrid classifier improves identification of diverse hosting platforms.
Abstract
Paper publications are no longer the only form of research product. Due to recent initiatives by publication venues and funding institutions, open access datasets and software products are increasingly considered research products and URIs to these products are growing more prevalent in scholarly publications. However, as with all URIs, resources found on the live Web are not permanent. Archivists and institutions including Software Heritage, Internet Archive, and Zenodo are working to preserve data and software products as valuable parts of reproducibility, a cornerstone of scientific research. While some hosting platforms are well-known and can be identified with regular expressions, there are a vast number of smaller, more niche hosting platforms utilized by researchers to host their data and software. If it is not feasible to manually identify all hosting platforms used by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Research Data Management Practices · Data Quality and Management
