Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge
Zhihong Chen, Guanbin Li, Xiang Wan

TL;DR
This paper introduces a knowledge-enhanced medical vision-and-language pre-training method that aligns, reasons, and learns from structured medical knowledge, significantly improving performance on multiple downstream tasks.
Contribution
It systematically incorporates structured medical knowledge into Med-VLP by aligning representations, enabling reasoning, and emphasizing critical information, which was lacking in prior methods.
Findings
Achieves state-of-the-art results on all downstream tasks.
Effectively integrates medical knowledge into vision-and-language models.
Provides a comprehensive benchmark for future research.
Abstract
Medical vision-and-language pre-training (Med-VLP) has received considerable attention owing to its applicability to extracting generic vision-and-language representations from medical images and texts. Most existing methods mainly contain three elements: uni-modal encoders (i.e., a vision encoder and a language encoder), a multi-modal fusion module, and pretext tasks, with few studies considering the importance of medical domain expert knowledge and explicitly exploiting such knowledge to facilitate Med-VLP. Although there exist knowledge-enhanced vision-and-language pre-training (VLP) methods in the general domain, most require off-the-shelf toolkits (e.g., object detectors and scene graph parsers), which are unavailable in the medical domain. In this paper, we propose a systematic and effective approach to enhance Med-VLP by structured medical knowledge from three perspectives.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsALIGN
