Fusion of Domain-Adapted Vision and Language Models for Medical Visual   Question Answering

Cuong Nhat Ha; Shima Asaadi; Sanjeev Kumar Karn; Oladimeji Farri,; Tobias Heimann; Thomas Runkler

arXiv:2404.16192·cs.CL·April 26, 2024·1 cites

Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering

Cuong Nhat Ha, Shima Asaadi, Sanjeev Kumar Karn, Oladimeji Farri,, Tobias Heimann, Thomas Runkler

PDF

Open Access 1 Video

TL;DR

This paper introduces a specialized medical vision-language model that adapts large pre-trained models for medical VQA tasks, achieving state-of-the-art accuracy through multi-stage, parameter-efficient training on biomedical datasets.

Contribution

It presents a novel approach to adapt large vision and language models specifically for medical VQA, improving accuracy in specialized domains.

Findings

01

Achieves 87.5% accuracy on SLAKE 1.0 MedVQA dataset.

02

Attains 73.2% accuracy on VQA-RAD dataset.

03

Demonstrates effectiveness of multi-stage domain-specific adaptation.

Abstract

Vision-language models, while effective in general domains and showing strong performance in diverse multi-modal applications like visual question-answering (VQA), struggle to maintain the same level of effectiveness in more specialized domains, e.g., medical. We propose a medical vision-language model that integrates large vision and language models adapted for the medical domain. This model goes through three stages of parameter-efficient training using three separate biomedical and radiology multi-modal visual and text datasets. The proposed model achieves state-of-the-art performance on the SLAKE 1.0 medical VQA (MedVQA) dataset with an overall accuracy of 87.5% and demonstrates strong performance on another MedVQA dataset, VQA-RAD, achieving an overall accuracy of 73.2%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques