# PreprintToPaper dataset: connecting bioRxiv preprints with journal publications

**Authors:** Fidan Badalova, Julian Sienkiewicz, Philipp Mayr

PMC · DOI: 10.1038/s41597-026-06867-3 · Scientific Data · 2026-02-24

## TL;DR

The PreprintToPaper dataset links bioRxiv preprints to their journal publications, helping researchers study how scientific papers evolve from preprint to final publication.

## Contribution

The dataset introduces a large-scale, time-separated analysis of preprint-to-journal publication dynamics, including pandemic-era changes and a human-annotated subset for reliability.

## Key findings

- The dataset includes metadata for 145,517 preprints from 2016–2018 and 2020–2022.
- A version-history subset allows analysis of preprint evolution over time.
- A human-annotated subset of 299 records improves reliability for Gray Zone cases.

## Abstract

The PreprintToPaper dataset connects bioRxiv preprints with their corresponding journal publications, enabling large-scale analysis of the preprint-to-publication process. It comprises metadata for 145,517 preprints from two periods, 2016–2018 (pre-pandemic) and 2020–2022 (pandemic), retrieved via the bioRxiv and Crossref APIs. We selected the two periods to capture preprint-publication dynamics before and during the COVID-19 pandemic while avoiding transitional years. Each record includes bibliographic information such as titles, abstracts, authors, institutions, submission dates, licenses, and subject categories, alongside enriched publication metadata including journal names, publication dates, author lists, and further information. In addition to the main dataset, a version-history subset provides all available versions of preprints within the two selected periods, enabling analysis of how preprints evolve over time. Preprints are categorized into three groups: Published (formally linked to a journal article), Preprint Only (posted on a preprint server), and Gray Zone (potentially published in a journal but unlinked). To enhance reliability, title and author similarity scores were computed, and a human-annotated subset of 299 records was created to evaluate Gray Zone cases. The dataset supports diverse applications, including studies of scholarly communication, open science policies, bibliometric tool development, and natural language processing research on textual changes between preprints and their  corresponding journal articles.

## Full-text entities

- **Diseases:** COVID-19 (MESH:D000086382)
- **Chemicals:** NA (MESH:D012964)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12936208/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12936208/full.md

## References

9 references — full list in the complete paper: https://tomesphere.com/paper/PMC12936208/full.md

---
Source: https://tomesphere.com/paper/PMC12936208