CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

Ruifeng Yuan; Wanxing Chang; Weiwei Cao; Bowen Shi; Zhongyu Wei; Ling Zhang; Jianpeng Zhang

arXiv:2604.24001·cs.AI·April 28, 2026

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

Ruifeng Yuan, Wanxing Chang, Weiwei Cao, Bowen Shi, Zhongyu Wei, Ling Zhang, Jianpeng Zhang

PDF

1 Datasets

TL;DR

CT-FineBench introduces a QA-based benchmark for evaluating the detailed factual accuracy of CT report generation, addressing limitations of existing coarse metrics.

Contribution

It presents a novel, clinically-relevant evaluation method that assesses fine-grained factual consistency in CT reports using a structured QA approach.

Findings

01

CT-FineBench correlates better with expert assessments.

02

It is more sensitive to detailed factual errors than prior metrics.

03

The benchmark improves evaluation granularity for CT report generation.

Abstract

The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of findings, and the presence of fine-grained, disease-oriented attributes. Conventional evaluation metrics offer only coarse measures of lexical overlap or entity matching and fail to reflect the granular diagnostic accuracy required for clinical use. To address this gap, we propose CT-FineBench, a benchmark built from CT-RATE and Merlin to evaluate the fine-grained factual consistency of CT reports, constructed from CT-RATE and Merlin. Our benchmark is constructed through a meticulous, Question-Answering (QA) based process: first, we identify and structure key, finding-specific clinical attributes (like location, size, margin). Second, we systematically transform these attributes into a QA dataset, where questions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

csyrf/CT-FineBench
dataset· 116 dl
116 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.