Identifying and extracting Data Access Statements from full-text academic articles
David Pride, Matteo Cancellieri, Petr Knoth

TL;DR
This paper presents a machine learning-based module developed by CORE to automatically identify and extract Data Access Statements from full-text academic articles, enhancing metadata quality and data sharing practices.
Contribution
The paper introduces a novel machine learning tool for automated DAS identification and extraction, improving efficiency and standardization in scholarly metadata curation.
Findings
Reduces manual curation workload
Improves metadata accuracy and completeness
Supports compliance with funder data-sharing policies
Abstract
A Data Access Statement (DAS) is a formal declaration detailing how and where the underlying research data associated with a publication can be accessed. It promotes transparency, reproducibility, and compliance with funder and publisher data-sharing requirements. Funders such as Plan S, the European Union, UKRI, and NIH emphasise the inclusion of DAS in publications, underscoring its growing importance. While a DAS enhances research by increasing transparency, discoverability, and data quality while clarifying access protocols and elevating datasets as first-class research outputs, the repository community faces challenges in managing and curating DAS as a standard metadata component. Manual DAS curation remains labour-intensive and time-consuming, hindering efficient data-sharing practices. CORE has co-designed with the repository community a module that uses machine learning to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsResearch Data Management Practices · Scientific Computing and Data Management · Academic Publishing and Open Access
