VERT: Reliable LLM Judges for Radiology Report Evaluation
Federica Bologna, Jean-Philippe Corbeil, Matthew Wilkens, Asma Ben Abacha

TL;DR
This paper evaluates and improves LLM-based metrics for radiology report assessment across various modalities, showing that fine-tuning and lightweight adaptation significantly enhance correlation with expert judgments.
Contribution
It introduces VERT, a new LLM-based metric that outperforms existing ones, and demonstrates effective lightweight fine-tuning methods for reliable radiology report evaluation.
Findings
VERT improves correlation with radiologists by up to 11.7%.
Fine-tuning Qwen3 30B yields 25% gains with only 1,300 samples.
Fine-tuning reduces inference time by up to 37.2 times.
Abstract
Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation? We conduct a thorough correlation analysis between expert and LLM-based ratings. We compare three existing LLM-as-a-judge metrics (RadFact, GREEN, and FineRadScore) alongside VERT, our proposed LLM-based metric, using open- and closed-source models (reasoning and non-reasoning) of different sizes across two expert-annotated datasets, RadEval and RaTE-Eval, spanning multiple modalities and anatomies. We further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using RaTE-Eval. To better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
