Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models
Shunjie-Fabian Zheng, Hyeonjun Lee, Thijs Kooi, Ali Diba

TL;DR
This paper presents a multi-modal vision-language framework for breast cancer detection that combines mammogram images with clinical text to improve diagnostic accuracy and clinical practicality.
Contribution
It introduces a novel multi-modal approach integrating visual and textual data, outperforming vision transformers in clinical breast cancer detection tasks.
Findings
Superior performance in cancer detection and calcification identification.
Effective fusion of imaging and clinical data enhances diagnostic accuracy.
Practical deployment potential across diverse populations.
Abstract
Breast cancer remains the most commonly diagnosed malignancy among women in the developed world. Early detection through mammography screening plays a pivotal role in reducing mortality rates. While computer-aided diagnosis (CAD) systems have shown promise in assisting radiologists, existing approaches face critical limitations in clinical deployment - particularly in handling the nuanced interpretation of multi-modal data and feasibility due to the requirement of prior clinical history. This study introduces a novel framework that synergistically combines visual features from 2D mammograms with structured textual descriptors derived from easily accessible clinical metadata and synthesized radiological reports through innovative tokenization modules. Our proposed methods in this study demonstrate that strategic integration of convolutional neural networks (ConvNets) with language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
