Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation

Shansong Wang; Zhecheng Jin; Mingzhe Hu; Mojtaba Safari; Feng Zhao; Chih-Wei Chang; Richard LJ Qiu; Justin Roper; David S. Yu; Xiaofeng Yang

arXiv:2506.22567·cs.CV·July 1, 2025

Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation

Shansong Wang, Zhecheng Jin, Mingzhe Hu, Mojtaba Safari, Feng Zhao, Chih-Wei Chang, Richard LJ Qiu, Justin Roper, David S. Yu, Xiaofeng Yang

PDF

Open Access

TL;DR

This paper introduces MMKD-CLIP, a biomedical foundation model trained via multi-teacher knowledge distillation from existing models, enabling robust performance across diverse biomedical tasks without relying on massive raw datasets.

Contribution

The paper presents a novel two-stage training pipeline for biomedical foundation models using multi-CLIP knowledge distillation, overcoming data scarcity and heterogeneity issues.

Findings

01

Outperforms individual teacher models across tasks

02

Demonstrates robustness across 58 biomedical datasets

03

Effective without billion-scale raw data

Abstract

CLIP models pretrained on natural images with billion-scale image-text pairs have demonstrated impressive capabilities in zero-shot classification, cross-modal retrieval, and open-ended visual answering. However, transferring this success to biomedicine is hindered by the scarcity of large-scale biomedical image-text corpora, the heterogeneity of image modalities, and fragmented data standards across institutions. These limitations hinder the development of a unified and generalizable biomedical foundation model trained from scratch. To overcome this, we introduce MMKD-CLIP, a generalist biomedical foundation model developed via Multiple Medical CLIP Knowledge Distillation. Rather than relying on billion-scale raw data, MMKD-CLIP distills knowledge from nine state-of-the-art domain-specific or generalist biomedical CLIP models, each pretrained on millions of biomedical image-text pairs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling