Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary results
Lauren Cadwallader, Iain Hrynaszkiewicz, parth sarin, Tim Vines

TL;DR
This paper introduces a new LLM-based indicator to measure research data reuse in scholarly publications, revealing a 43% reuse rate and demonstrating the potential of generative AI for large-scale impact assessment.
Contribution
The study develops an innovative generative AI method to quantify research data reuse, providing preliminary results that suggest higher reuse rates than traditional bibliometric techniques.
Findings
Data reuse rate of 43% exceeds previous estimates.
LLMs can effectively measure data reuse at scale.
Research data sharing's positive impact may be underestimated.
Abstract
Numerous metascience studies and other initiatives have begun to monitor the prevalence of open science practices when it is more important to understand the 'downstream' effects or impacts of open science. PLOS and DataSeer have developed a new LLM-based indicator to measure an important effect of open science: the reuse of research data. Our results show a data reuse rate of 43%, which is higher than established bibliometric techniques. We show that data reuse can be measured at scale using LLMs and generative artificial intelligence. The positive effects of research data sharing and reuse may currently be underestimated.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
