RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores
Yingshu Li, Yunyi Liu, Lingqiao Liu, Lei Wang, and Luping Zhou

TL;DR
RadReason is an innovative evaluation framework for radiology reports that provides detailed, interpretable sub-scores and explanations, improving accuracy and clinical relevance over existing metrics.
Contribution
RadReason introduces a clinically grounded, explainable evaluation method with adaptive weighting and advantage scaling, advancing report assessment beyond coarse scores.
Findings
Outperforms prior offline metrics on ReXVal benchmark
Achieves parity with GPT-4 evaluations in accuracy
Provides human-readable justifications for scores
Abstract
Evaluating automatically generated radiology reports remains a fundamental challenge due to the lack of clinically grounded, interpretable, and fine-grained metrics. Existing methods either produce coarse overall scores or rely on opaque black-box models, limiting their usefulness in real-world clinical workflows. We introduce RadReason, a novel evaluation framework for radiology reports that not only outputs fine-grained sub-scores across six clinically defined error types, but also produces human-readable justifications that explain the rationale behind each score. Our method builds on Group Relative Policy Optimization and incorporates two key innovations: (1) Sub-score Dynamic Weighting, which adaptively prioritizes clinically challenging error types based on live F1 statistics; and (2) Majority-Guided Advantage Scaling, which adjusts policy gradient updates based on prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Radiology practices and education · Machine Learning in Healthcare
