# Conceptual proposal for LLM-generated FDG PET/CT follow-up reports in melanoma: a pilot study on model stability and blinded expert evaluation

**Authors:** Wolfram A. Bosbach, Marie S. Heide, Nasir Gözlügöl, Dana Fatemeh, Foroud Aghapour Zangeneh, David Ventura, Philipp Schindler, Wolfgang Roll, Franziska Strunz, Federico Caobelli, Kuangyu Shi, Ali Afshar-Oromieh, Axel Rominger, Robert Seifert

PMC · DOI: 10.3389/fnume.2026.1723650 · 2026-03-12

## TL;DR

This pilot study explores whether large language models can generate high-quality follow-up reports for melanoma PET/CT scans, showing results comparable to human experts.

## Contribution

The study introduces a novel approach to evaluating LLM-generated medical reports using blinded expert assessments and test–retest stability analysis.

## Key findings

- LLM-generated reports showed high intra-case coherence and comparable quality to human-authored reports.
- External human readers rated LLM reports higher than internal readers and preferred LLM impressions.
- LLM performance improved with case complexity, unlike human performance which declined.

## Abstract

Oncological patients regularly undergo PET/CT re-staging, which requires a report that outlines their current disease status and highlights relevant changes compared to the previous PET/CT. Large language models (LLMs) may be helpful with documentation in the future. This study is a pilot on LLM performance, focusing on test–retest stability and reproducibility.

Three textbook melanoma follow-up cases of increasing complexity (involving one to eight organs) were selected. From standardized text-only prompts (no imaging data), follow-up reports were written by GPT-4o, Claude Sonnet 4 (each producing three independent revisions), and three nuclear medicine residents. This yielded nine reports per case (27 in total). Six blinded nuclear medicine experts (three internal, three external) performed test–retest evaluations of report quality and authorship identification.

The cosine similarity analysis revealed high intra-case coherence (mean: 0.599–0.727) regardless of authorship. The external human readers consistently rated reports higher than the internal human readers. The LLM-generated reports received comparable or superior ratings to human reports, with Claude achieving the highest external reader scores (mean 0.926, standard deviation 0.263, on a 0–1 scale). Human performance declined with case complexity, while Claude, in particular, improved. The external readers significantly preferred the LLM impressions (Fisher’s exact test, p = 0.005). Neither the human nor LLM readers reliably identified authorship (balanced accuracy 0.343–0.500).

In this pilot, blinded expert evaluation demonstrated that current LLMs can generate reports for melanoma [18F]fluorodeoxyglucose PET/CT of comparable quality to human-authored reports from text prompts in this study. High test–retest stability was obtained. Larger future studies will be required to confirm these findings.

## Linked entities

- **Chemicals:** [18F]fluorodeoxyglucose (PubChem CID 68614)
- **Diseases:** melanoma (MONDO:0005105)

## Full-text entities

- **Diseases:** melanoma (MESH:D008545)
- **Chemicals:** FDG (MESH:D019788)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13018137/full.md

---
Source: https://tomesphere.com/paper/PMC13018137