BiomedCLIP: a multimodal biomedical foundation model pretrained from   fifteen million scientific image-text pairs

Sheng Zhang; Yanbo Xu; Naoto Usuyama; Hanwen Xu; Jaspreet Bagga,; Robert Tinn; Sam Preston; Rajesh Rao; Mu Wei; Naveen Valluri; Cliff Wong,; Andrea Tupini; Yu Wang; Matt Mazzola; Swadheen Shukla; Lars Liden; Jianfeng; Gao; Angela Crabtree; Brian Piening; Carlo Bifulco; Matthew P. Lungren,; Tristan Naumann; Sheng Wang; and Hoifung Poon

arXiv:2303.00915·cs.CV·January 10, 2025·97 cites

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga,, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong,, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng, Gao, Angela Crabtree, Brian Piening, Carlo Bifulco

PDF

Open Access 5 Repos 7 Models 2 Datasets

TL;DR

BiomedCLIP is a large-scale multimodal biomedical foundation model pretrained on 15 million image-text pairs, achieving state-of-the-art results across diverse biomedical vision-language tasks and outperforming specialized models.

Contribution

The paper introduces PMC-15M, a massive biomedical multimodal dataset, and presents BiomedCLIP, a pretrained model that advances biomedical vision-language understanding.

Findings

01

Achieved state-of-the-art results on multiple biomedical datasets.

02

Outperformed radiology-specific models in radiology tasks.

03

Demonstrated the effectiveness of large-scale multimodal pretraining.

Abstract

Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI

MethodsContrastive Language-Image Pre-training