BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga,, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong,, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng, Gao, Angela Crabtree, Brian Piening, Carlo Bifulco

TL;DR
BiomedCLIP is a large-scale multimodal biomedical foundation model pretrained on 15 million image-text pairs, achieving state-of-the-art results across diverse biomedical vision-language tasks and outperforming specialized models.
Contribution
The paper introduces PMC-15M, a massive biomedical multimodal dataset, and presents BiomedCLIP, a pretrained model that advances biomedical vision-language understanding.
Findings
Achieved state-of-the-art results on multiple biomedical datasets.
Outperformed radiology-specific models in radiology tasks.
Demonstrated the effectiveness of large-scale multimodal pretraining.
Abstract
Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗naotous/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224_originalmodel· 6 dl6 dl
- 🤗ikim-uk-essen/BiomedCLIP_ViT_patch16_224model· 16 dl· ♡ 316 dl♡ 3
- 🤗pfytas/biomedclip_custommodel· 3 dl3 dl
- 🤗ZiyueWang/biomedclipmodel· 26 dl26 dl
- 🤗microsoft/llava-radmodel· 935 dl· ♡ 20935 dl♡ 20
- 🤗razaimam45/RobustMedCLIPmodel
- 🤗X-iZhang/libra-llava-radmodel· 98 dl· ♡ 298 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI
MethodsContrastive Language-Image Pre-training
