Multi-task Paired Masking with Alignment Modeling for Medical   Vision-Language Pre-training

Ke Zhang; Yan Yang; Jun Yu; Hanliang Jiang; Jianping Fan; Qingming; Huang; Weidong Han

arXiv:2305.07920·cs.CV·October 24, 2023·5 cites

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Ke Zhang, Yan Yang, Jun Yu, Hanliang Jiang, Jianping Fan, Qingming, Huang, Weidong Han

PDF

Open Access

TL;DR

This paper introduces a novel medical vision-language pre-training framework that enhances cross-modal alignment and interaction through multi-task paired masking, leading to improved performance across various downstream medical imaging tasks.

Contribution

It proposes a unified Med-VLP framework with multi-task paired masking, alignment modeling, and modules for better cross-modal interaction and semantic representation in medical imaging.

Findings

01

Outperforms previous methods in all downstream tasks

02

Enhances cross-modal interaction and semantic understanding

03

Improves performance in uni-, cross-, and multi-modal tasks

Abstract

In recent years, the growing demand for medical imaging diagnosis has placed a significant burden on radiologists. As a solution, Medical Vision-Language Pre-training (Med-VLP) methods have been proposed to learn universal representations from medical images and reports, benefiting downstream tasks without requiring fine-grained annotations. However, existing methods have overlooked the importance of cross-modal alignment in joint image-text reconstruction, resulting in insufficient cross-modal interaction. To address this limitation, we propose a unified Med-VLP framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework to achieve more comprehensive cross-modal interaction, while a Global and Local Alignment (GLA) module is designed to assist self-supervised paradigm in obtaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning