Towards Trustworthy Dermatology MLLMs: A Benchmark and Multimodal Evaluator for Diagnostic Narratives

Yuhao Shen; Jiahe Qian; Shuping Zhang; Zhangtianyi Chen; Tao Lu; Juexiao Zhou

arXiv:2511.09195·cs.CV·January 13, 2026

Towards Trustworthy Dermatology MLLMs: A Benchmark and Multimodal Evaluator for Diagnostic Narratives

Yuhao Shen, Jiahe Qian, Shuping Zhang, Zhangtianyi Chen, Tao Lu, Juexiao Zhou

PDF

Open Access 3 Reviews

TL;DR

This paper presents DermBench and DermEval, a comprehensive framework for evaluating dermatology multimodal LLMs, ensuring reliable, reproducible, and detailed assessment aligned with clinical standards.

Contribution

Introduction of DermBench and DermEval, novel benchmarks and evaluators for assessing dermatology diagnostic narratives generated by multimodal LLMs.

Findings

01

DermBench correlates strongly with expert ratings (deviation 0.251).

02

DermEval provides fine-grained, case-specific critique and scoring.

03

Framework enables scalable, trustworthy evaluation of clinical LLMs.

Abstract

Multimodal large language models (LLMs) are increasingly used to generate dermatology diagnostic narratives directly from images. However, reliable evaluation remains the primary bottleneck for responsible clinical deployment. We introduce a novel evaluation framework that combines DermBench, a meticulously curated benchmark, with DermEval, a robust automatic evaluator, to enable clinically meaningful, reproducible, and scalable assessment. We build DermBench, which pairs 4,000 real-world dermatology images with expert-certified diagnostic narratives and uses an LLM-based judge to score candidate narratives across clinically grounded dimensions, enabling consistent and comprehensive evaluation of multimodal models. For individual case assessment, we train DermEval, a reference-free multimodal evaluator. Given an image and a generated narrative, DermEval produces a structured critique…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. Prior work focusing on LVLM performance evaluation in dermatology is limited, and this study addresses this gap by providing both a benchmark and an evaluator tailored to this field. 2. The evaluation dimensions cover multiple clinically meaningful aspects beyond simple accuracy, which provides a more holistic understanding of model behavior.

Weaknesses

1. The clarity of Figure 2 is unacceptable. The font size in the figure is significantly smaller than that of the main text, and even at 200% magnification, the text remains unreadable. This seriously affects the readability and professionalism of the manuscript. 2. Since DermEval is designed to function without reference annotations, how is its evaluation reliability ensured? Comparing only against human ratings is insufficient. The paper lacks experiments that demonstrate the evaluator’s tru

Reviewer 02Rating 4Confidence 3

Strengths

1. The framework moves beyond simple accuracy to evaluate narratives on six well-defined dimensions that are critical for clinical trustworthiness, such as safety and reasoning coherence. 2. The process for creating "certified references" is rigorous, involving generation followed by a human-in-the-loop process where board-certified dermatologists review and revise narratives until they achieve a perfect score. This ensures a high-quality gold standard.

Weaknesses

1. DermBench uses a human-certified reference, but the final scoring is still performed by an LLM-judge. This introduces a potential layer of abstraction and bias. Although the alignment tests show this works well, the system's ultimate "ground truth" for benchmarking still relies on a model's judgment rather than direct human scoring of the candidate narratives. 2. The certified references are initially drafted by an MLLM before being revised by clinicians. This process might inadvertently anch

Reviewer 03Rating 2Confidence 3

Strengths

The paper is well-motivated by the lack of datasets and methods that benchmark trustworthiness in generative LLMs for dermatology. This is a critical problem, given that today LLMs are widely used to help generate medical descriptions/captions for images, and there is no reliable measurement or evaluation method in this domain. This paper is among the first to bring this issue to attention.

Weaknesses

- Before moving on to the method, it is important to standardize the 6-dimensional criteria on the reliability of generated text by LLMs. Specifically, what it means for a narrative to be "accurate, safe, grounded, comprehensive, coherent and precise"? What do you mean by accuracy and precision? What is the difference between accuracy being rated 3 vs. 4? On one hand, it's important to make sure that the dermatologist and the LLM to be evaluated share the same rubric for assessment. If this is n

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCutaneous Melanoma Detection and Management · AI in cancer detection · Multimodal Machine Learning Applications