Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning
Negin Baghbanzadeh, Mohammed Saidul Islam, Sajad Ashkezari, Elham Dolatabadi, Arash Afkanpour

TL;DR
This paper introduces Open-PMC-18M, a large-scale, high-fidelity biomedical dataset with 18 million image-text pairs, enhancing multimodal medical representation learning through advanced data curation and extensive evaluation.
Contribution
The paper presents a novel data curation pipeline for biomedical images, creating the largest high-quality dataset for medical multimodal learning, and demonstrates its effectiveness with state-of-the-art models.
Findings
Models trained on Open-PMC-18M achieve new state-of-the-art results.
The dataset improves performance across multiple medical imaging modalities.
The proposed curation process enhances data quality and generalizability.
Abstract
In biomedical vision-language modeling, datasets are typically mined from scientific literature, pairing compound figures with captions that are short, context-dependent, and oftern partially informative. Prior work on subfigure extraction has been limited in both dataset size and generalizability. In addition, no existing effort has incorporated rich medical context in image-text pairs. We revisit data curation as a foundational component of effective biomedical representation learning. Our data curation process integrates transformer-based subfigure detection, subcaption extraction, and contextual text enrichment derived from inline references. Our subfigure extraction model, trained on a corpus of 500,000 compound figures, achieves state-of-the-art performance on real and synthetic benchmarks. Using this process, we curate and release Open-PMC-18M, a large-scale high-fidelity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
