MaskedCLIP: Bridging the Masked and CLIP Space for Semi-Supervised Medical Vision-Language Pre-training

Lei Zhu; Jun Zhou; Rick Siow Mong Goh; Yong Liu

arXiv:2507.17239·cs.CV·July 24, 2025

MaskedCLIP: Bridging the Masked and CLIP Space for Semi-Supervised Medical Vision-Language Pre-training

Lei Zhu, Jun Zhou, Rick Siow Mong Goh, Yong Liu

PDF

Open Access

TL;DR

MaskedCLIP introduces a semi-supervised vision-language pre-training framework that combines paired and unpaired medical images using a bridge transformer and masked knowledge distillation, enhancing feature generalization for medical image analysis.

Contribution

The paper presents MaskedCLIP, a novel framework that effectively integrates paired and unpaired image data for foundation model learning in medical imaging.

Findings

01

Improves downstream task performance in retinal image analysis.

02

Enhances data efficiency in medical vision-language pre-training.

03

Effectively bridges feature spaces from different data types.

Abstract

Foundation models have recently gained tremendous popularity in medical image analysis. State-of-the-art methods leverage either paired image-text data via vision-language pre-training or unpaired image data via self-supervised pre-training to learn foundation models with generalizable image features to boost downstream task performance. However, learning foundation models exclusively on either paired or unpaired image data limits their ability to learn richer and more comprehensive image features. In this paper, we investigate a novel task termed semi-supervised vision-language pre-training, aiming to fully harness the potential of both paired and unpaired image data for foundation model learning. To this end, we propose MaskedCLIP, a synergistic masked image modeling and contrastive language-image pre-training framework for semi-supervised vision-language pre-training. The key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies