Multi-Modal Masked Autoencoders for Medical Vision-and-Language   Pre-Training

Zhihong Chen; Yuhao Du; Jinpeng Hu; Yang Liu; Guanbin Li; Xiang Wan,; Tsung-Hui Chang

arXiv:2209.07098·cs.CV·September 16, 2022

Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training

Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan,, Tsung-Hui Chang

PDF

Open Access 1 Repo

TL;DR

This paper introduces M$^3$AE, a self-supervised multi-modal masked autoencoder framework for medical vision-and-language pre-training, achieving state-of-the-art results on a new benchmark with innovative masking and reconstruction strategies.

Contribution

Proposes a novel multi-modal masked autoencoder approach for medical vision-and-language pre-training with tailored masking ratios and separate decoders, advancing the field's understanding and performance.

Findings

01

Achieves state-of-the-art results on medical vision-and-language tasks.

02

Demonstrates effectiveness of different masking ratios for images and texts.

03

Validates the importance of layer-specific feature reconstruction.

Abstract

Medical vision-and-language pre-training provides a feasible solution to extract effective vision-and-language representations from medical images and texts. However, few studies have been dedicated to this field to facilitate medical vision-and-language understanding. In this paper, we propose a self-supervised learning paradigm with multi-modal masked autoencoders (M $^{3}$ AE), which learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts. There are three key designs to make this simple approach work. First, considering the different information densities of vision and language, we adopt different masking ratios for the input image and text, where a considerably larger masking ratio is used for images. Second, we use visual and textual features from different layers to perform the reconstruction to deal with different levels of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhjohnchan/m3ae
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Radiomics and Machine Learning in Medical Imaging · Cancer-related molecular mechanisms research

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Dense Connections · Residual Connection · Absolute Position Encodings · Dropout