# Accuracy and reproducibility of large language model measurements of liver metastases: comparison with radiologist measurements

**Authors:** Haruto Sugawara, Akiyo Takada, Shimpei Kato

PMC · DOI: 10.1007/s11604-025-01884-5 · Japanese Journal of Radiology · 2025-10-04

## TL;DR

This study compares how well large language models and radiologists measure liver metastases in CT scans, finding that Gemini performs best but still lags behind human experts.

## Contribution

The study evaluates the performance of three LLMs in measuring liver metastases and compares them to radiologists for the first time.

## Key findings

- Radiologists showed excellent inter-observer and intra-observer agreement.
- Gemini achieved good agreement and reproducibility compared to radiologists.
- GPT-o3 and Claude had poor reproducibility and agreement.

## Abstract

To compare the accuracy and reproducibility of lesion-diameter measurements performed by three state-of-the-art LLMs with those obtained by radiologists.

In this retrospective study using a public database, 83 patients with solitary colorectal-cancer liver metastases were identified. From each CT series, a radiologist extracted the single axial slice showing the maximal tumor diameter and converted it to a 512 × 512-pixel PNG image (window level 50 HU, window width 400 HU) with pixel size encoded in the filename. Three LLMs—ChatGPT-o3 (OpenAI), Gemini 2.5 Pro (Google), and Claude 4 Opus (Anthropic)—were prompted to estimate the longest lesion diameter twice, ≥ 1 week apart. Two board-certified radiologists (12 years’ experience each) independently measured the same single slice images and one radiologist repeated the measurements after ≥ 1 week. Agreement was assessed with intraclass correlation coefficients (ICC); 95% confidence intervals were obtained by bootstrap resampling (5 000 iterations).

Radiologist inter-observer agreement was excellent (ICC = 0.95, 95% CI 0.86–0.99); intra-observer agreement was 0.98 (95% CI 0.94–0.99). Gemini achieved good model-to-radiologist agreement (ICC = 0.81, 95% CI 0.68–0.89) and intra-model reproducibility (ICC = 0.78, 95% CI 0.65–0.87). GPT-o3 showed moderate agreement (ICC = 0.52) and poor reproducibility (ICC = 0.25); Claude showed poor agreement (ICC = 0.07) and reproducibility (ICC = 0.47).

LLMs do not yet match radiologists in measuring colorectal cancer liver metastasis; however, Gemini’s good agreement and reproducibility highlight the rapid progress of image interpretation capability of LLMs.

## Linked entities

- **Diseases:** colorectal cancer (MONDO:0005575)

## Full-text entities

- **Diseases:** tumor (MESH:D009369), liver metastases (MESH:D009362), colorectal cancer liver metastasis (MESH:D015179)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12860874/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12860874/full.md

---
Source: https://tomesphere.com/paper/PMC12860874