MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification
Halil Ibrahim Gulluk, Olivier Gevaert

TL;DR
This paper introduces MAM-CLIP, a vision-language model trained on mammography images and captions to improve BI-RADS classification, especially with limited labeled data.
Contribution
It presents a novel multi-modal approach using contrastive learning on image-caption pairs from mammography atlases, enhancing model performance over traditional methods.
Findings
3-class F1 score improved by up to 14% with fewer labeled samples
2K image-text pairs can outperform 2K labeled samples in training
Pretrained model achieves superior BI-RADS prediction accuracy
Abstract
Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
