RAMM: Retrieval-augmented Biomedical Visual Question Answering with   Multi-modal Pre-training

Zheng Yuan; Qiao Jin; Chuanqi Tan; Zhengyun Zhao; Hongyi Yuan; Fei; Huang; Songfang Huang

arXiv:2303.00534·cs.CV·March 2, 2023·5 cites

RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

Zheng Yuan, Qiao Jin, Chuanqi Tan, Zhengyun Zhao, Hongyi Yuan, Fei, Huang, Songfang Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces RAMM, a retrieval-augmented pretraining and fine-tuning approach for biomedical visual question answering, leveraging a new dataset PMCPM and retrieval mechanisms to improve performance amid limited data.

Contribution

The paper proposes a novel retrieval-augmented paradigm for biomedical VQA, including a new dataset PMCPM, a retrieval-attention module, and state-of-the-art results on multiple datasets.

Findings

01

Achieved state-of-the-art performance on Med-VQA2019, Med-VQA2021, VQARAD, and SLAKE datasets.

02

PMCPM dataset enhances biomedical VQA capabilities.

03

Retrieval-augmented approach outperforms previous methods.

Abstract

Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA). Compared to general domain VQA, the performance of biomedical VQA suffers from limited data. In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA to overcome the data limitation issue. Specifically, we collect a new biomedical dataset named PMCPM which offers patient-based image-text pairs containing diverse patient situations from PubMed. Then, we pretrain the biomedical multi-modal model to learn visual and textual representation for image-text pairs and align these representations with image-text contrastive objective (ITC). Finally, we propose a retrieval-augmented method to better use the limited data. We propose to retrieve similar image-text pairs based on ITC from pretraining datasets and introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

GanjinZero/RAMM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsALIGN