Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training
Jong Hak Moon, Hyungyung Lee, Woncheol Shin, Young-Hak Kim, Edward, Choi

TL;DR
MedViLL is a novel BERT-based multi-modal model that effectively integrates radiology images and reports for diverse medical vision-language tasks, achieving superior performance across multiple datasets.
Contribution
The paper introduces MedViLL, a BERT-based architecture with a new multi-modal attention masking scheme tailored for medical image and report understanding and generation.
Findings
Outperforms baseline models on four downstream tasks
Demonstrates strong generalization across three datasets
Effective for both understanding and report generation tasks
Abstract
Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training objectives. In this work we explore a broad set of multi-modal representation learning tasks in the medical domain, specifically using radiology images and the unstructured report. We propose Medical Vision Language Learner (MedViLL), which adopts a BERT-based architecture combined with a novel multi-modal attention masking scheme to maximize generalization performance for both vision-language understanding tasks (diagnosis classification, medical image-report retrieval, medical visual question answering) and vision-language generation task (radiology report generation). By statistically and rigorously evaluating the proposed model on four downstream tasks with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsLinear Layer · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Byte Pair Encoding · Transformer · Linear Warmup With Linear Decay · WordPiece · Layer Normalization · Attention Dropout
