Does CLIP Benefit Visual Question Answering in the Medical Domain as   Much as it Does in the General Domain?

Sedigheh Eslami; Gerard de Melo; Christoph Meinel

arXiv:2112.13906·cs.CV·December 30, 2021·41 cites

Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?

Sedigheh Eslami, Gerard de Melo, Christoph Meinel

PDF

Open Access 2 Models

TL;DR

This paper evaluates CLIP's effectiveness in medical visual question answering, introduces PubMedCLIP fine-tuned on medical data, and compares its performance to other models, revealing domain-specific challenges and improvements.

Contribution

The study presents PubMedCLIP, a domain-specific fine-tuned version of CLIP for medical VQA, and provides a comprehensive analysis of its performance compared to general models.

Findings

01

PubMedCLIP outperforms MAML in accuracy by up to 3%

02

Visual language supervision improves MedVQA performance

03

Different datasets exhibit varying behaviors with visual encoders

Abstract

Contrastive Language--Image Pre-training (CLIP) has shown remarkable success in learning with cross-modal supervision from extensive amounts of image--text pairs collected online. Thus far, the effectiveness of CLIP has been investigated primarily in general-domain multimodal problems. This work evaluates the effectiveness of CLIP for the task of Medical Visual Question Answering (MedVQA). To this end, we present PubMedCLIP, a fine-tuned version of CLIP for the medical domain based on PubMed articles. Our experiments are conducted on two MedVQA benchmark datasets and investigate two MedVQA methods, MEVF (Mixture of Enhanced Visual Features) and QCR (Question answering via Conditional Reasoning). For each of these, we assess the merits of visual representation learning using PubMedCLIP, the original CLIP, and state-of-the-art MAML (Model-Agnostic Meta-Learning) networks pre-trained only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsModel-Agnostic Meta-Learning · Contrastive Language-Image Pre-training