Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training
Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan,, Tsung-Hui Chang

TL;DR
This paper introduces M$^3$AE, a self-supervised multi-modal masked autoencoder framework for medical vision-and-language pre-training, achieving state-of-the-art results on a new benchmark with innovative masking and reconstruction strategies.
Contribution
Proposes a novel multi-modal masked autoencoder approach for medical vision-and-language pre-training with tailored masking ratios and separate decoders, advancing the field's understanding and performance.
Findings
Achieves state-of-the-art results on medical vision-and-language tasks.
Demonstrates effectiveness of different masking ratios for images and texts.
Validates the importance of layer-specific feature reconstruction.
Abstract
Medical vision-and-language pre-training provides a feasible solution to extract effective vision-and-language representations from medical images and texts. However, few studies have been dedicated to this field to facilitate medical vision-and-language understanding. In this paper, we propose a self-supervised learning paradigm with multi-modal masked autoencoders (MAE), which learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts. There are three key designs to make this simple approach work. First, considering the different information densities of vision and language, we adopt different masking ratios for the input image and text, where a considerably larger masking ratio is used for images. Second, we use visual and textual features from different layers to perform the reconstruction to deal with different levels of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Radiomics and Machine Learning in Medical Imaging · Cancer-related molecular mechanisms research
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Dense Connections · Residual Connection · Absolute Position Encodings · Dropout
