Adapting Lightweight Vision Language Models for Radiological Visual Question Answering
Aditya Shourya, Michel Dumontier, Chang Sun

TL;DR
This paper demonstrates that a small, fine-tuned vision-language model can effectively perform radiological visual question answering, offering a cost-effective alternative to larger models while providing diagnostic tools for performance inspection.
Contribution
The study introduces a lightweight 3B parameter vision-language model fine-tuned for radiological VQA, with a novel training pipeline and diagnostic tools for model evaluation.
Findings
Small models can achieve robust radiological VQA performance.
Curated data and synthetic question-answer generation are effective for training.
Diagnostic saliency tools help identify model failure modes.
Abstract
Recent advancements in vision-language systems have improved the accuracy of Radiological Visual Question Answering (VQA) Models. However, some challenges remain across each stage of model development: limited expert-labeled images hinders data procurement at scale; the intricate and nuanced patterns of radiological images make modeling inherently difficult; and the lack of evaluation evaluation efforts makes it difficult to identify cases where the model might be ill-conditioned. In this study, we fine-tune a lightweight 3B parameter vision-language model for Radiological VQA, demonstrating that small models, when appropriately tuned with curated data, can achieve robust performance across both open- and closed-ended questions. We propose a cost-effective training pipeline from synthetic question-answer pair generation to multi-stage fine-tuning on specialised radiological…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
