Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Negin Baghbanzadeh; Mohammed Saidul Islam; Sajad Ashkezari; Elham Dolatabadi; Arash Afkanpour

arXiv:2506.02738·cs.CV·December 8, 2025

Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Negin Baghbanzadeh, Mohammed Saidul Islam, Sajad Ashkezari, Elham Dolatabadi, Arash Afkanpour

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces Open-PMC-18M, a large-scale, high-fidelity biomedical dataset with 18 million image-text pairs, enhancing multimodal medical representation learning through advanced data curation and extensive evaluation.

Contribution

The paper presents a novel data curation pipeline for biomedical images, creating the largest high-quality dataset for medical multimodal learning, and demonstrates its effectiveness with state-of-the-art models.

Findings

01

Models trained on Open-PMC-18M achieve new state-of-the-art results.

02

The dataset improves performance across multiple medical imaging modalities.

03

The proposed curation process enhances data quality and generalizability.

Abstract

In biomedical vision-language modeling, datasets are typically mined from scientific literature, pairing compound figures with captions that are short, context-dependent, and oftern partially informative. Prior work on subfigure extraction has been limited in both dataset size and generalizability. In addition, no existing effort has incorporated rich medical context in image-text pairs. We revisit data curation as a foundational component of effective biomedical representation learning. Our data curation process integrates transformer-based subfigure detection, subcaption extraction, and contextual text enrichment derived from inline references. Our subfigure extraction model, trained on a corpus of 500,000 compound figures, achieves state-of-the-art performance on real and synthetic benchmarks. Using this process, we curate and release Open-PMC-18M, a large-scale high-fidelity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
vector-institute/pmc-18m-dab-detr
model· 41 dl
41 dl

Datasets

vector-institute/open-pmc-18m
dataset· 4.6k dl
4.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning