Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training
Ke Zhang, Yan Yang, Jun Yu, Hanliang Jiang, Jianping Fan, Qingming, Huang, Weidong Han

TL;DR
This paper introduces a novel medical vision-language pre-training framework that enhances cross-modal alignment and interaction through multi-task paired masking, leading to improved performance across various downstream medical imaging tasks.
Contribution
It proposes a unified Med-VLP framework with multi-task paired masking, alignment modeling, and modules for better cross-modal interaction and semantic representation in medical imaging.
Findings
Outperforms previous methods in all downstream tasks
Enhances cross-modal interaction and semantic understanding
Improves performance in uni-, cross-, and multi-modal tasks
Abstract
In recent years, the growing demand for medical imaging diagnosis has placed a significant burden on radiologists. As a solution, Medical Vision-Language Pre-training (Med-VLP) methods have been proposed to learn universal representations from medical images and reports, benefiting downstream tasks without requiring fine-grained annotations. However, existing methods have overlooked the importance of cross-modal alignment in joint image-text reconstruction, resulting in insufficient cross-modal interaction. To address this limitation, we propose a unified Med-VLP framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework to achieve more comprehensive cross-modal interaction, while a Global and Local Alignment (GLA) module is designed to assist self-supervised paradigm in obtaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
