Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

Roy Jiang; Hyunjae Kim; Zhenyue Qin; Morten Lee; Margaret MacGibeny; Ailish Hanly; Angela Sadlowski; Shanin Chowdhury; Xuguang Ai; Jeffrey Gehlhausen; Qingyu Chen

arXiv:2605.04098·cs.CV·May 7, 2026

Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

Roy Jiang, Hyunjae Kim, Zhenyue Qin, Morten Lee, Margaret MacGibeny, Ailish Hanly, Angela Sadlowski, Shanin Chowdhury, Xuguang Ai, Jeffrey Gehlhausen, Qingyu Chen

PDF

TL;DR

This study evaluates the real-world clinical performance of multimodal large language models in dermatology, revealing significant gaps between benchmark results and practical diagnostic accuracy in hospital settings.

Contribution

It provides a comprehensive real-world assessment of current dermatology MLLMs, highlighting their limitations and the impact of clinical context on diagnostic performance.

Findings

01

Benchmark accuracy was modest and declined in real-world data.

02

Incorporating clinical context improved diagnostic accuracy.

03

Models showed moderate sensitivity for severity triage, but were unreliable for diagnosis.

Abstract

Multimodal large language models (MLLMs) have demonstrated promise on publicly available dermatology benchmarks. However, benchmark performance may not generalize to real-world dermatologic decision-making. To quantify this benchmark-to-bedside gap, we evaluated four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4 and MedGemma-4B-Instruct) and one commercial MLLM (GPT-4.1) across three publicly available dermatology datasets and a retrospective multi-site hospital-based dermatology consultation cohort comprising 5,811 cases and 46,405 clinical images. Models were evaluated on two clinically relevant tasks: differential diagnosis generation and severity-based triage. Diagnostic performance was modest on public datasets and declined substantially in the real-world cohort. On public benchmarks, top-3 diagnostic accuracy reached 26.55% for the best open-weight model and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.