CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction

Jing Zou; Qingqiu Li; Chenyu Lian; Lihao Liu; Xiaohan Yan; Shujun Wang; Jing Qin

arXiv:2505.12057·cs.AI·May 20, 2025

CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction

Jing Zou, Qingqiu Li, Chenyu Lian, Lihao Liu, Xiaohan Yan, Shujun Wang, Jing Qin

PDF

Open Access

TL;DR

This paper introduces CorBenchX, a large-scale dataset and benchmark for error detection and correction in chest X-ray reports, and proposes a reinforcement learning framework to improve model performance in clinical report correction.

Contribution

It provides the first large-scale dataset for chest X-ray report error correction and benchmarks multiple vision-language models, proposing a novel reinforcement learning method to enhance correction accuracy.

Findings

01

o4-mini achieves 50.6% detection accuracy

02

MSRL improves detection precision by 38.3%

03

MSRL enhances correction scores by 5.2%

Abstract

AI-driven models have shown great promise in detecting errors in radiology reports, yet the field lacks a unified benchmark for rigorous evaluation of error detection and further correction. To address this gap, we introduce CorBenchX, a comprehensive suite for automated error detection and correction in chest X-ray reports, designed to advance AI-assisted quality control in clinical practice. We first synthesize a large-scale dataset of 26,326 chest X-ray error reports by injecting clinically common errors via prompting DeepSeek-R1, with each corrupted report paired with its original text, error type, and human-readable description. Leveraging this dataset, we benchmark both open- and closed-source vision-language models,(e.g., InternVL, Qwen-VL, GPT-4o, o4-mini, and Claude-3.7) for error detection and correction under zero-shot prompting. Among these models, o4-mini achieves the best…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications · Topic Modeling