Multi-modal Understanding and Generation for Medical Images and Text via   Vision-Language Pre-Training

Jong Hak Moon; Hyungyung Lee; Woncheol Shin; Young-Hak Kim; Edward; Choi

arXiv:2105.11333·cs.CV·September 22, 2022·1 cites

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Jong Hak Moon, Hyungyung Lee, Woncheol Shin, Young-Hak Kim, Edward, Choi

PDF

Open Access 1 Repo

TL;DR

MedViLL is a novel BERT-based multi-modal model that effectively integrates radiology images and reports for diverse medical vision-language tasks, achieving superior performance across multiple datasets.

Contribution

The paper introduces MedViLL, a BERT-based architecture with a new multi-modal attention masking scheme tailored for medical image and report understanding and generation.

Findings

01

Outperforms baseline models on four downstream tasks

02

Demonstrates strong generalization across three datasets

03

Effective for both understanding and report generation tasks

Abstract

Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training objectives. In this work we explore a broad set of multi-modal representation learning tasks in the medical domain, specifically using radiology images and the unstructured report. We propose Medical Vision Language Learner (MedViLL), which adopts a BERT-based architecture combined with a novel multi-modal attention masking scheme to maximize generalization performance for both vision-language understanding tasks (diagnosis classification, medical image-report retrieval, medical visual question answering) and vision-language generation task (radiology report generation). By statistically and rigorously evaluating the proposed model on four downstream tasks with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SuperSupermoon/MedViLL
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsLinear Layer · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Byte Pair Encoding · Transformer · Linear Warmup With Linear Decay · WordPiece · Layer Normalization · Attention Dropout