Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

Ran Zhang; Yucong Lin; Zhaoli Su; Bowen Liu; Danni Ai; Tianyu Fu; Deqiang Xiao; Jingfan Fan; Yuanyuan Wang; Mingwei Gao; Yuwan Hu; Shuya Gao; Jingtao Li; Jian Yang; Hong Song; Hongliang Sun

arXiv:2603.22935·cs.AI·March 25, 2026

Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

Ran Zhang, Yucong Lin, Zhaoli Su, Bowen Liu, Danni Ai, Tianyu Fu, Deqiang Xiao, Jingfan Fan, Yuanyuan Wang, Mingwei Gao, Yuwan Hu, Shuya Gao, Jingtao Li, Jian Yang, Hong Song, Hongliang Sun

PDF

Open Access

TL;DR

This paper introduces Ran Score, a new evaluation metric for radiology report generation that leverages large language models and clinician input to better recognize abnormalities and handle clinical language nuances.

Contribution

It presents a clinician-guided framework for multi-label finding extraction and introduces Ran Score, a finding-level metric that improves evaluation accuracy over existing benchmarks.

Findings

01

Ran Score significantly outperforms previous benchmarks.

02

Framework generalizes well across different cohorts.

03

Prompt optimization enhances agreement with radiologist standards.

Abstract

Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 diagnosis using AI · Topic Modeling · Artificial Intelligence in Healthcare and Education