Identifying and extracting Data Access Statements from full-text academic articles

David Pride; Matteo Cancellieri; Petr Knoth

arXiv:2512.00001·cs.DL·December 2, 2025

Identifying and extracting Data Access Statements from full-text academic articles

David Pride, Matteo Cancellieri, Petr Knoth

PDF

Open Access

TL;DR

This paper presents a machine learning-based module developed by CORE to automatically identify and extract Data Access Statements from full-text academic articles, enhancing metadata quality and data sharing practices.

Contribution

The paper introduces a novel machine learning tool for automated DAS identification and extraction, improving efficiency and standardization in scholarly metadata curation.

Findings

01

Reduces manual curation workload

02

Improves metadata accuracy and completeness

03

Supports compliance with funder data-sharing policies

Abstract

A Data Access Statement (DAS) is a formal declaration detailing how and where the underlying research data associated with a publication can be accessed. It promotes transparency, reproducibility, and compliance with funder and publisher data-sharing requirements. Funders such as Plan S, the European Union, UKRI, and NIH emphasise the inclusion of DAS in publications, underscoring its growing importance. While a DAS enhances research by increasing transparency, discoverability, and data quality while clarifying access protocols and elevating datasets as first-class research outputs, the repository community faces challenges in managing and curating DAS as a standard metadata component. Manual DAS curation remains labour-intensive and time-consuming, hindering efficient data-sharing practices. CORE has co-designed with the repository community a module that uses machine learning to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsResearch Data Management Practices · Scientific Computing and Data Management · Academic Publishing and Open Access