CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation
Ruifeng Yuan, Wanxing Chang, Weiwei Cao, Bowen Shi, Zhongyu Wei, Ling Zhang, Jianpeng Zhang

TL;DR
CT-FineBench introduces a QA-based benchmark for evaluating the detailed factual accuracy of CT report generation, addressing limitations of existing coarse metrics.
Contribution
It presents a novel, clinically-relevant evaluation method that assesses fine-grained factual consistency in CT reports using a structured QA approach.
Findings
CT-FineBench correlates better with expert assessments.
It is more sensitive to detailed factual errors than prior metrics.
The benchmark improves evaluation granularity for CT report generation.
Abstract
The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of findings, and the presence of fine-grained, disease-oriented attributes. Conventional evaluation metrics offer only coarse measures of lexical overlap or entity matching and fail to reflect the granular diagnostic accuracy required for clinical use. To address this gap, we propose CT-FineBench, a benchmark built from CT-RATE and Merlin to evaluate the fine-grained factual consistency of CT reports, constructed from CT-RATE and Merlin. Our benchmark is constructed through a meticulous, Question-Answering (QA) based process: first, we identify and structure key, finding-specific clinical attributes (like location, size, margin). Second, we systematically transform these attributes into a QA dataset, where questions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
