Diagnostic Accuracy of Open-Source Vision-Language Models on Diverse Medical Imaging Tasks
Gustav M\"uller-Franzes, Debora Jutz, Jakob Nikolas Kather, Christiane Kuhl, Sven Nebelung, Daniel Truhn

TL;DR
This study evaluates five open-source vision-language models across diverse medical imaging tasks, revealing promising diagnostic accuracy in some domains but limited performance in complex areas like retinal fundoscopy, highlighting the need for further refinement.
Contribution
It provides a comprehensive benchmark of open-source VLMs on multiple medical imaging tasks, comparing their accuracy and analyzing the impact of different input modalities and reasoning strategies.
Findings
Qwen2.5 achieved highest accuracy in chest radiographs (90.4%)
Models struggled with retinal fundoscopy, with accuracies around 18.6%
Multimodal input and chain-of-thought reasoning did not consistently improve accuracy.
Abstract
This retrospective study evaluated five VLMs (Qwen2.5, Phi-4, Gemma3, Llama3.2, and Mistral3.1) using the MedFMC dataset. This dataset includes 22,349 images from 7,461 patients encompassing chest radiography (19 disease multi-label classifications), colon pathology (tumor detection), endoscopy (colorectal lesion identification), neonatal jaundice assessment (skin color-based treatment necessity), and retinal fundoscopy (5-point diabetic retinopathy grading). Diagnostic accuracy was compared in three experimental settings: visual input only, multimodal input, and chain-of-thought reasoning. Model accuracy was assessed against ground truth labels, with statistical comparisons using bootstrapped confidence intervals (p<.05). Qwen2.5 achieved the highest accuracy for chest radiographs (90.4%) and endoscopy images (84.2%), significantly outperforming the other models (p<.001). In colon…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · AI in cancer detection · Retinal Imaging and Analysis
