Biomedical Data Manifest: A lightweight data documentation mapping to increase transparency for AI/ML
Daniel Bottomly, Christopher G. Suciu, Benjamin Cordier, Nathaniel Evans, Alfonso Poire, Christina Zheng, Jeffrey Myers, Jeffrey Myers, Vlad Sandulache, Trever Bivona, Jack Roth, Boyi Gan, Albert Koong, Pankaj Singh, Michael Hollingsworth, Jixin Dong, Brian Druker

TL;DR
The paper introduces a new data documentation framework called the Biomedical Data Manifest to improve transparency and reduce bias in biomedical ML datasets.
Contribution
The novel contribution is a modular, role-specific documentation template that reduces generator burden while ensuring relevant information for end-users.
Findings
A two-step process identified key documentation fields and role-specific priorities among biomedical stakeholders.
The Biomedical Data Manifest was developed to provide modular and transparent dataset documentation.
The framework supports transparency and bias mitigation in datasets used for ML applications.
Abstract
Biomedical machine learning (ML) models raise critical concerns about embedded assumptions influencing clinical decision-making, necessitating robust documentation frameworks for datasets that are shared via external repositories. Fairness-aware algorithm effectiveness hinges on users’ prior awareness of specific issues in the data – information such as data collection methodology, provenance and quality. Current ML-focused documentation approaches impose impractical burdens on data generators and conflate data/model accountability. This is problematic for resource datasets not explicitly created for ML applications. This study addresses these gaps through a two-step process: First, we derived consensus documentation fields by mapping elements across four key templates. Second, we surveyed biomedical stakeholders across four roles (clinicians, bench scientists, data manager and…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Scientific Computing and Data Management
