TL;DR
UniMed-CLIP introduces a large-scale, multi-modal medical dataset and a unified vision-language model trained on diverse medical imaging modalities, significantly improving zero-shot performance and generalization across medical tasks.
Contribution
The paper presents UniMed, a comprehensive open-source dataset and a unified VLM for multiple medical imaging modalities, enabling scalable pretraining and better cross-modality generalization.
Findings
UniMed-CLIP outperforms existing generalist VLMs in medical tasks.
Achieves +12.61 absolute gain over BiomedCLIP in zero-shot evaluations.
Uses 3x less training data than proprietary models.
Abstract
Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks. However, their application in the medical domain remains limited due to the scarcity of openly accessible, large-scale medical image-text datasets. Existing medical VLMs either train on closed-source proprietary or relatively small open-source datasets that do not generalize well. Similarly, most models remain specific to a single or limited number of medical imaging domains, again restricting their applicability to other modalities. To address this gap, we introduce UniMed, a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs across six diverse imaging modalities: X-ray, CT, MRI, Ultrasound, Pathology, and Fundus. UniMed is developed using a data-collection framework that leverages Large Language Models (LLMs) to transform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
