CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
Hyungyung Lee, Geon Choi, Jung-Oh Lee, Hangyul Yoon, Hyuk Gi Hong, Edward Choi

TL;DR
CXReasonBench is a new benchmark designed to evaluate the ability of vision-language models to perform structured, clinically meaningful reasoning on chest X-ray images, moving beyond simple diagnostic answers.
Contribution
It introduces CheXStruct, a pipeline for deriving intermediate reasoning steps from X-rays, and CXReasonBench, a benchmark for assessing models' structured diagnostic reasoning capabilities.
Findings
Most evaluated LVLMs struggle with structured reasoning.
Models often fail to connect abstract knowledge with visual interpretation.
Benchmark enables fine-grained, transparent assessment of diagnostic reasoning.
Abstract
Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsFocus
