Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?
Sedigheh Eslami, Gerard de Melo, Christoph Meinel

TL;DR
This paper evaluates CLIP's effectiveness in medical visual question answering, introduces PubMedCLIP fine-tuned on medical data, and compares its performance to other models, revealing domain-specific challenges and improvements.
Contribution
The study presents PubMedCLIP, a domain-specific fine-tuned version of CLIP for medical VQA, and provides a comprehensive analysis of its performance compared to general models.
Findings
PubMedCLIP outperforms MAML in accuracy by up to 3%
Visual language supervision improves MedVQA performance
Different datasets exhibit varying behaviors with visual encoders
Abstract
Contrastive Language--Image Pre-training (CLIP) has shown remarkable success in learning with cross-modal supervision from extensive amounts of image--text pairs collected online. Thus far, the effectiveness of CLIP has been investigated primarily in general-domain multimodal problems. This work evaluates the effectiveness of CLIP for the task of Medical Visual Question Answering (MedVQA). To this end, we present PubMedCLIP, a fine-tuned version of CLIP for the medical domain based on PubMed articles. Our experiments are conducted on two MedVQA benchmark datasets and investigate two MedVQA methods, MEVF (Mixture of Enhanced Visual Features) and QCR (Question answering via Conditional Reasoning). For each of these, we assess the merits of visual representation learning using PubMedCLIP, the original CLIP, and state-of-the-art MAML (Model-Agnostic Meta-Learning) networks pre-trained only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsModel-Agnostic Meta-Learning · Contrastive Language-Image Pre-training
