3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection
Haowen Zhu, Ning Yin, Xiaogen Zhou

TL;DR
MedMAP is a novel pretraining framework that improves vision-language alignment and feature fusion in 3D MRI for multi-organ abnormality detection, significantly outperforming existing models.
Contribution
Introduces MedMAP, a modality-aware pretraining approach specifically designed for 3D MRI vision-language tasks, addressing modality-specific challenges.
Findings
MedMAP outperforms existing VLMs on 3D MRI abnormality detection.
Curated MedMoM-MRI3D dataset with 7,392 MRI-volume report pairs.
Effective joint modality distribution capture during pretraining.
Abstract
Vision-language models (VLMs) show strong potential for complex diagnostic tasks in medical imaging. However, applying VLMs to multi-organ medical imaging introduces two principal challenges: (1) modality-specific vision-language alignment and (2) cross-modal feature fusion. In this work, we propose MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language representation learning in 3D MRI. MedMAP comprises a modality-aware vision-language alignment stage and a fine-tuning stage for multi-organ abnormality detection. During the pre-training stage, the modality-aware encoders implicitly capture the joint modality distribution and improve alignment between visual and textual representations. We then fine-tune the pre-trained vision encoders (while keeping the text encoder frozen) for downstream tasks. To this end, we curated MedMoM-MRI3D, comprising 7,392 3D MRI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI · Advanced Neural Network Applications
