RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)

Yishu Wei; Adam E. Flanders; Errol Colak; John Mongan; Luciano M Prevedello; Po-Hao Chen; Henrique Min Ho Lee; Gilberto Szarf; Hamilton Shoji; Jason Sho; Katherine Andriole; Tessa Cook; Lisa C. Adams; Linda C. Chu; Maggie Chung; Geraldine Brusca-Augello; Djeven P. Deva; Navneet Singh; Felipe Sanchez Tijmes; Jeffrey B. Alpert; Elsie T. Nguyen; Drew A. Torigian; Kate Hanneman; Lauren K Groner; Alexander Phan; Ali Islam; Matias F.Callejas; Gustavo Borges da Silva Teles; Faisal Jamal; Maryam Vazirabad; Ali Tejani; Hari Trivedi; Paulo Kuriki; Rajesh Bhayana; Elana T. Benishay; Yi Lin; Yifan Peng; George Shih

arXiv:2601.15129·cs.CL·January 22, 2026

RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)

Yishu Wei, Adam E. Flanders, Errol Colak, John Mongan, Luciano M Prevedello, Po-Hao Chen, Henrique Min Ho Lee, Gilberto Szarf, Hamilton Shoji, Jason Sho, Katherine Andriole, Tessa Cook, Lisa C. Adams, Linda C. Chu, Maggie Chung, Geraldine Brusca-Augello, Djeven P. Deva

PDF

Open Access

TL;DR

This paper introduces a high-quality, radiologist-verified benchmark dataset of chest radiographs with AI-assisted labeling, enabling improved evaluation and development of multimodal large language models in radiology.

Contribution

It presents a curated dataset of 200 chest radiographs with expert-verified labels and an AI-assisted labeling process to enhance efficiency and accuracy in radiological annotation.

Findings

01

Created a publicly available benchmark dataset of 200 radiographs.

02

Developed an AI-assisted labeling procedure for radiologists.

03

Achieved high agreement among radiologists on labels.

Abstract

Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · COVID-19 diagnosis using AI · Radiology practices and education