Diagnostic Accuracy of Open-Source Vision-Language Models on Diverse Medical Imaging Tasks

Gustav M\"uller-Franzes; Debora Jutz; Jakob Nikolas Kather; Christiane Kuhl; Sven Nebelung; Daniel Truhn

arXiv:2508.01016·eess.IV·August 5, 2025

Diagnostic Accuracy of Open-Source Vision-Language Models on Diverse Medical Imaging Tasks

Gustav M\"uller-Franzes, Debora Jutz, Jakob Nikolas Kather, Christiane Kuhl, Sven Nebelung, Daniel Truhn

PDF

Open Access

TL;DR

This study evaluates five open-source vision-language models across diverse medical imaging tasks, revealing promising diagnostic accuracy in some domains but limited performance in complex areas like retinal fundoscopy, highlighting the need for further refinement.

Contribution

It provides a comprehensive benchmark of open-source VLMs on multiple medical imaging tasks, comparing their accuracy and analyzing the impact of different input modalities and reasoning strategies.

Findings

01

Qwen2.5 achieved highest accuracy in chest radiographs (90.4%)

02

Models struggled with retinal fundoscopy, with accuracies around 18.6%

03

Multimodal input and chain-of-thought reasoning did not consistently improve accuracy.

Abstract

This retrospective study evaluated five VLMs (Qwen2.5, Phi-4, Gemma3, Llama3.2, and Mistral3.1) using the MedFMC dataset. This dataset includes 22,349 images from 7,461 patients encompassing chest radiography (19 disease multi-label classifications), colon pathology (tumor detection), endoscopy (colorectal lesion identification), neonatal jaundice assessment (skin color-based treatment necessity), and retinal fundoscopy (5-point diabetic retinopathy grading). Diagnostic accuracy was compared in three experimental settings: visual input only, multimodal input, and chain-of-thought reasoning. Model accuracy was assessed against ground truth labels, with statistical comparisons using bootstrapped confidence intervals (p<.05). Qwen2.5 achieved the highest accuracy for chest radiographs (90.4%) and endoscopy images (84.2%), significantly outperforming the other models (p<.001). In colon…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 diagnosis using AI · AI in cancer detection · Retinal Imaging and Analysis