CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation

Vanshali Sharma; Andrea Mia Bejar; Gorkem Durak; Ulas Bagci

arXiv:2601.11488·cs.CL·January 19, 2026

CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation

Vanshali Sharma, Andrea Mia Bejar, Gorkem Durak, Ulas Bagci

PDF

Open Access

TL;DR

This paper introduces CTest-Metric, a comprehensive framework for evaluating the clinical validity of metrics used in CT radiology report generation, addressing the need for standardized assessment in medical AI.

Contribution

It presents the first unified framework with modules for testing metric robustness, generalizability, and correlation with clinician judgments, enhancing reproducibility and benchmarking.

Findings

01

GREEN Score correlates best with expert ratings (Spearman~0.70)

02

Lexical metrics are highly sensitive to stylistic variations

03

BERTScore-F1 is least affected by factual errors

Abstract

In the generative AI era, where even critical medical tasks are increasingly automated, radiology report generation (RRG) continues to rely on suboptimal metrics for quality assessment. Developing domain-specific metrics has therefore been an active area of research, yet it remains challenging due to the lack of a unified, well-defined framework to assess their robustness and applicability in clinical contexts. To address this, we present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG. The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175 "disagreement" cases. Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Radiomics and Machine Learning in Medical Imaging · Radiology practices and education