The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency
Dingyu Wang, Zimu Yuan, Jiajun Liu, Shanggui Liu, Nan Zhou, Tianxing Xu, Di Huang, Dong Jiang

TL;DR
This paper introduces the B&J Benchmark to evaluate vision-language models' clinical reasoning, revealing significant performance gaps in multimodal tasks crucial for real-world patient care.
Contribution
The study presents a comprehensive, real-world clinical benchmark and evaluates multiple models, exposing limitations in current AI's multimodal reasoning abilities in clinical contexts.
Findings
High accuracy (>90%) on structured questions
Performance drops to ~60% on open-ended multimodal tasks
Models often hallucinate and ignore visual evidence
Abstract
Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (B&J) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications · Clinical Reasoning and Diagnostic Skills
