Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering
Cuong Nhat Ha, Shima Asaadi, Sanjeev Kumar Karn, Oladimeji Farri,, Tobias Heimann, Thomas Runkler

TL;DR
This paper introduces a specialized medical vision-language model that adapts large pre-trained models for medical VQA tasks, achieving state-of-the-art accuracy through multi-stage, parameter-efficient training on biomedical datasets.
Contribution
It presents a novel approach to adapt large vision and language models specifically for medical VQA, improving accuracy in specialized domains.
Findings
Achieves 87.5% accuracy on SLAKE 1.0 MedVQA dataset.
Attains 73.2% accuracy on VQA-RAD dataset.
Demonstrates effectiveness of multi-stage domain-specific adaptation.
Abstract
Vision-language models, while effective in general domains and showing strong performance in diverse multi-modal applications like visual question-answering (VQA), struggle to maintain the same level of effectiveness in more specialized domains, e.g., medical. We propose a medical vision-language model that integrates large vision and language models adapted for the medical domain. This model goes through three stages of parameter-efficient training using three separate biomedical and radiology multi-modal visual and text datasets. The proposed model achieves state-of-the-art performance on the SLAKE 1.0 medical VQA (MedVQA) dataset with an overall accuracy of 87.5% and demonstrates strong performance on another MedVQA dataset, VQA-RAD, achieving an overall accuracy of 73.2%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
