Biomedical Data Manifest: A lightweight data documentation mapping to increase transparency for AI/ML

Daniel Bottomly; Christopher G. Suciu; Benjamin Cordier; Nathaniel Evans; Alfonso Poire; Christina Zheng; Jeffrey Myers; Jeffrey Myers; Vlad Sandulache; Trever Bivona; Jack Roth; Boyi Gan; Albert Koong; Pankaj Singh; Michael Hollingsworth; Jixin Dong; Brian Druker; David W. Goodrich; Song Liu; Tao Liu; Christopher Willey; Joshi Alumkal; Keith Syson Chan; Phuoc Tran; Chunru Lin; Erina Vlashi; Alice Soragni; Paul C. Boutros; Erik Knudsen; Agnieszka Witkiewicz; Xingxing Zang; Michael Deininger; Jeffrey W. Tyner; Alan Hutson; Shannon K. McWeeney; Jeffrey W. Tyner; Alan Hutson; Shannon K. McWeeney

PMC · DOI:10.1038/s41597-026-06670-0·February 11, 2026

Biomedical Data Manifest: A lightweight data documentation mapping to increase transparency for AI/ML

Daniel Bottomly, Christopher G. Suciu, Benjamin Cordier, Nathaniel Evans, Alfonso Poire, Christina Zheng, Jeffrey Myers, Jeffrey Myers, Vlad Sandulache, Trever Bivona, Jack Roth, Boyi Gan, Albert Koong, Pankaj Singh, Michael Hollingsworth, Jixin Dong, Brian Druker

PDF

Open Access

TL;DR

The paper introduces a new data documentation framework called the Biomedical Data Manifest to improve transparency and reduce bias in biomedical ML datasets.

Contribution

The novel contribution is a modular, role-specific documentation template that reduces generator burden while ensuring relevant information for end-users.

Findings

01

A two-step process identified key documentation fields and role-specific priorities among biomedical stakeholders.

02

The Biomedical Data Manifest was developed to provide modular and transparent dataset documentation.

03

The framework supports transparency and bias mitigation in datasets used for ML applications.

Abstract

Biomedical machine learning (ML) models raise critical concerns about embedded assumptions influencing clinical decision-making, necessitating robust documentation frameworks for datasets that are shared via external repositories. Fairness-aware algorithm effectiveness hinges on users’ prior awareness of specific issues in the data – information such as data collection methodology, provenance and quality. Current ML-focused documentation approaches impose impractical burdens on data generators and conflate data/model accountability. This is problematic for resource datasets not explicitly created for ML applications. This study addresses these gaps through a two-step process: First, we derived consensus documentation fields by mapping elements across four key templates. Second, we surveyed biomedical stakeholders across four roles (clinicians, bench scientists, data manager and…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals2

Novartis Cepheid

Diseases5

DM AI Cancer ML leukemia

Figures6

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Scientific Computing and Data Management