The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency

Dingyu Wang; Zimu Yuan; Jiajun Liu; Shanggui Liu; Nan Zhou; Tianxing Xu; Di Huang; Dong Jiang

arXiv:2512.22275·cs.CV·December 30, 2025

The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency

Dingyu Wang, Zimu Yuan, Jiajun Liu, Shanggui Liu, Nan Zhou, Tianxing Xu, Di Huang, Dong Jiang

PDF

Open Access

TL;DR

This paper introduces the B&J Benchmark to evaluate vision-language models' clinical reasoning, revealing significant performance gaps in multimodal tasks crucial for real-world patient care.

Contribution

The study presents a comprehensive, real-world clinical benchmark and evaluates multiple models, exposing limitations in current AI's multimodal reasoning abilities in clinical contexts.

Findings

01

High accuracy (>90%) on structured questions

02

Performance drops to ~60% on open-ended multimodal tasks

03

Models often hallucinate and ignore visual evidence

Abstract

Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (B&J) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications · Clinical Reasoning and Diagnostic Skills