Adapting Lightweight Vision Language Models for Radiological Visual Question Answering

Aditya Shourya; Michel Dumontier; Chang Sun

arXiv:2506.14451·cs.CV·June 18, 2025

Adapting Lightweight Vision Language Models for Radiological Visual Question Answering

Aditya Shourya, Michel Dumontier, Chang Sun

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper demonstrates that a small, fine-tuned vision-language model can effectively perform radiological visual question answering, offering a cost-effective alternative to larger models while providing diagnostic tools for performance inspection.

Contribution

The study introduces a lightweight 3B parameter vision-language model fine-tuned for radiological VQA, with a novel training pipeline and diagnostic tools for model evaluation.

Findings

01

Small models can achieve robust radiological VQA performance.

02

Curated data and synthetic question-answer generation are effective for training.

03

Diagnostic saliency tools help identify model failure modes.

Abstract

Recent advancements in vision-language systems have improved the accuracy of Radiological Visual Question Answering (VQA) Models. However, some challenges remain across each stage of model development: limited expert-labeled images hinders data procurement at scale; the intricate and nuanced patterns of radiological images make modeling inherently difficult; and the lack of evaluation evaluation efforts makes it difficult to identify cases where the model might be ill-conditioned. In this study, we fine-tune a lightweight 3B parameter vision-language model for Radiological VQA, demonstrating that small models, when appropriately tuned with curated data, can achieve robust performance across both open- and closed-ended questions. We propose a cost-effective training pipeline from synthetic question-answer pair generation to multi-stage fine-tuning on specialised radiological…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

adishourya/medm
pytorchOfficial

Datasets

adishourya/MEDPIX-ShortQA
dataset· 84 dl
84 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques