CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

Hyungyung Lee; Geon Choi; Jung-Oh Lee; Hangyul Yoon; Hyuk Gi Hong; Edward Choi

arXiv:2505.18087·cs.CV·October 28, 2025

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

Hyungyung Lee, Geon Choi, Jung-Oh Lee, Hangyul Yoon, Hyuk Gi Hong, Edward Choi

PDF

1 Repo 1 Datasets 1 Video

TL;DR

CXReasonBench is a new benchmark designed to evaluate the ability of vision-language models to perform structured, clinically meaningful reasoning on chest X-ray images, moving beyond simple diagnostic answers.

Contribution

It introduces CheXStruct, a pipeline for deriving intermediate reasoning steps from X-rays, and CXReasonBench, a benchmark for assessing models' structured diagnostic reasoning capabilities.

Findings

01

Most evaluated LVLMs struggle with structured reasoning.

02

Models often fail to connect abstract knowledge with visual interpretation.

03

Benchmark enables fine-grained, transparent assessment of diagnostic reasoning.

Abstract

Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ttumyche/cxreasonbench
noneOfficial

Datasets

ttumyche/CheXStruct
dataset· 25 dl
25 dl

Videos

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays· slideslive

Taxonomy

MethodsFocus