DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?

Tianhong Zhou; Yin Xu; Yingtao Zhu; Chuxi Xiao; Haiyang Bian; Lei Wei; Xuegong Zhang

arXiv:2505.24173·cs.CV·June 2, 2025

DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?

Tianhong Zhou, Yin Xu, Yingtao Zhu, Chuxi Xiao, Haiyang Bian, Lei Wei, Xuegong Zhang

PDF

Open Access 1 Repo

TL;DR

DrVD-Bench is a comprehensive multimodal benchmark designed to evaluate whether vision-language models genuinely reason like human clinicians in medical image diagnosis, highlighting current models' limitations in complex reasoning tasks.

Contribution

This work introduces DrVD-Bench, the first structured benchmark for clinical visual reasoning in medical imaging, covering diverse tasks, modalities, and diagnostic categories.

Findings

01

Performance drops with increased reasoning complexity.

02

Models often rely on superficial correlations rather than true understanding.

03

Some models show traces of human-like reasoning but lack grounded visual comprehension.

Abstract

Vision-language models (VLMs) exhibit strong zero-shot generalization on natural images and show early promise in interpretable medical image analysis. However, existing benchmarks do not systematically evaluate whether these models truly reason like human clinicians or merely imitate superficial patterns. To address this gap, we propose DrVD-Bench, the first multimodal benchmark for clinical visual reasoning. DrVD-Bench consists of three modules: Visual Evidence Comprehension, Reasoning Trajectory Assessment, and Report Generation Evaluation, comprising a total of 7,789 image-question pairs. Our benchmark covers 20 task types, 17 diagnostic categories, and five imaging modalities-CT, MRI, ultrasound, radiography, and pathology. DrVD-Bench is explicitly structured to reflect the clinical reasoning workflow from modality recognition to lesion identification and diagnosis. We benchmark 19…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jerry-boss/drvd-bench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies