Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models

Shunjie-Fabian Zheng; Hyeonjun Lee; Thijs Kooi; Ali Diba

arXiv:2510.25051·cs.CV·October 30, 2025

Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models

Shunjie-Fabian Zheng, Hyeonjun Lee, Thijs Kooi, Ali Diba

PDF

TL;DR

This paper presents a multi-modal vision-language framework for breast cancer detection that combines mammogram images with clinical text to improve diagnostic accuracy and clinical practicality.

Contribution

It introduces a novel multi-modal approach integrating visual and textual data, outperforming vision transformers in clinical breast cancer detection tasks.

Findings

01

Superior performance in cancer detection and calcification identification.

02

Effective fusion of imaging and clinical data enhances diagnostic accuracy.

03

Practical deployment potential across diverse populations.

Abstract

Breast cancer remains the most commonly diagnosed malignancy among women in the developed world. Early detection through mammography screening plays a pivotal role in reducing mortality rates. While computer-aided diagnosis (CAD) systems have shown promise in assisting radiologists, existing approaches face critical limitations in clinical deployment - particularly in handling the nuanced interpretation of multi-modal data and feasibility due to the requirement of prior clinical history. This study introduces a novel framework that synergistically combines visual features from 2D mammograms with structured textual descriptors derived from easily accessible clinical metadata and synthesized radiological reports through innovative tokenization modules. Our proposed methods in this study demonstrate that strategic integration of convolutional neural networks (ConvNets) with language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.