# Biomedical Data Manifest: A lightweight data documentation mapping to increase transparency for AI/ML

**Authors:** Daniel Bottomly, Christopher G. Suciu, Benjamin Cordier, Nathaniel Evans, Alfonso Poire, Christina Zheng, Jeffrey Myers, Jeffrey Myers, Vlad Sandulache, Trever Bivona, Jack Roth, Boyi Gan, Albert Koong, Pankaj Singh, Michael Hollingsworth, Jixin Dong, Brian Druker, David W. Goodrich, Song Liu, Tao Liu, Christopher Willey, Joshi Alumkal, Keith Syson Chan, Phuoc Tran, Chunru Lin, Erina Vlashi, Alice Soragni, Paul C. Boutros, Erik Knudsen, Agnieszka Witkiewicz, Xingxing Zang, Michael Deininger, Jeffrey W. Tyner, Alan Hutson, Shannon K. McWeeney, Jeffrey W. Tyner, Alan Hutson, Shannon K. McWeeney

PMC · DOI: 10.1038/s41597-026-06670-0 · 2026-02-11

## TL;DR

The paper introduces a new data documentation framework called the Biomedical Data Manifest to improve transparency and reduce bias in biomedical ML datasets.

## Contribution

The novel contribution is a modular, role-specific documentation template that reduces generator burden while ensuring relevant information for end-users.

## Key findings

- A two-step process identified key documentation fields and role-specific priorities among biomedical stakeholders.
- The Biomedical Data Manifest was developed to provide modular and transparent dataset documentation.
- The framework supports transparency and bias mitigation in datasets used for ML applications.

## Abstract

Biomedical machine learning (ML) models raise critical concerns about embedded assumptions influencing clinical decision-making, necessitating robust documentation frameworks for datasets that are shared via external repositories. Fairness-aware algorithm effectiveness hinges on users’ prior awareness of specific issues in the data – information such as data collection methodology, provenance and quality. Current ML-focused documentation approaches impose impractical burdens on data generators and conflate data/model accountability. This is problematic for resource datasets not explicitly created for ML applications. This study addresses these gaps through a two-step process: First, we derived consensus documentation fields by mapping elements across four key templates. Second, we surveyed biomedical stakeholders across four roles (clinicians, bench scientists, data manager and computationalists) to assess field importance and relevance. This revealed important role-dependent prioritization differences, motivating the development of the Biomedical Data Manifest – a modular template employing persona-specific field presentation reducing generator burden while ensuring end-users receive role-relevant information. The Biomedical Data Manifest improves transparency for datasets deposited in public or controlled-access repositories and bias mitigation across ML applications.

## Full-text entities

- **Diseases:** DM (MESH:D009223), AI (MESH:C538142), Cancer (MESH:D009369), ML (MESH:D007859), leukemia (MESH:D007938)
- **Chemicals:** Novartis (MESH:C014635), Cepheid (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13002863/full.md

---
Source: https://tomesphere.com/paper/PMC13002863