VERT: Reliable LLM Judges for Radiology Report Evaluation

Federica Bologna; Jean-Philippe Corbeil; Matthew Wilkens; Asma Ben Abacha

arXiv:2604.03376·cs.AI·April 7, 2026

VERT: Reliable LLM Judges for Radiology Report Evaluation

Federica Bologna, Jean-Philippe Corbeil, Matthew Wilkens, Asma Ben Abacha

PDF

TL;DR

This paper evaluates and improves LLM-based metrics for radiology report assessment across various modalities, showing that fine-tuning and lightweight adaptation significantly enhance correlation with expert judgments.

Contribution

It introduces VERT, a new LLM-based metric that outperforms existing ones, and demonstrates effective lightweight fine-tuning methods for reliable radiology report evaluation.

Findings

01

VERT improves correlation with radiologists by up to 11.7%.

02

Fine-tuning Qwen3 30B yields 25% gains with only 1,300 samples.

03

Fine-tuning reduces inference time by up to 37.2 times.

Abstract

Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation? We conduct a thorough correlation analysis between expert and LLM-based ratings. We compare three existing LLM-as-a-judge metrics (RadFact, GREEN, and FineRadScore) alongside VERT, our proposed LLM-based metric, using open- and closed-source models (reasoning and non-reasoning) of different sizes across two expert-annotated datasets, RadEval and RaTE-Eval, spanning multiple modalities and anatomies. We further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using RaTE-Eval. To better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.