# Evaluation of GPT-5 for Esophageal Cancer Staging Using Fluorodeoxyglucose Positron Emission Tomography Maximum-Intensity Projection Images: Comparative Pilot Study

**Authors:** Hiroki Maruyama, Yoshitaka Toyama, Yuya Araki, Kentaro Takanami, Masato Ito, Yumi Nakajima, Kei Takase, Takashi Kamei

PMC · DOI: 10.2196/86630 · JMIR Cancer · 2026-02-23

## TL;DR

This study compares the ability of GPT-5 and other large language models to physicians in staging esophageal cancer using PET images, finding that while models are not yet as accurate, newer versions show promise in specific tasks.

## Contribution

The study evaluates the diagnostic accuracy of multiple LLMs, including GPT-5, in automated staging of esophageal cancer using PET images, comparing them directly to human experts.

## Key findings

- Physicians outperformed LLMs in accuracy for thoracic LN, abdominal LN, and cN stages.
- GPT-5 showed the highest accuracy among LLMs, particularly in abdominal LN and cM staging.
- Newer LLMs demonstrated improved performance compared to older models, though consistency in cN staging was weaker.

## Abstract

Accurate esophageal cancer staging relies on 18F fluorodeoxyglucose positron emission tomography (18F FDG-PET), but its interpretation is complex and time-intensive. This diagnostic burden is exacerbated by significant workforce shortages in both radiology and surgery, thus necessitating automated support systems. The emergence of advanced large language models (LLMs) has raised expectations for their potential to fulfill this role in complex medical tasks.

We evaluated the diagnostic accuracy of LLMs for staging esophageal cancer using 18F FDG-PET images, with a focus on their ability to assess lymph nodes (LNs; clinical N [cN]) and distant metastases (clinical M [cM]) for automated radiology reporting.

This retrospective study included 120 consecutive adult patients who were diagnosed with esophageal squamous cell carcinoma and underwent 18F FDG-PET/computed tomography at Tohoku University Hospital between January 2019 and December 2021. Patients with prior treatment, nonsquamous cell carcinoma histology, or blood glucose levels ≥200 mg/dL were excluded. Frontal maximum-intensity projection positron emission tomography images were extracted, standardized, and analyzed along with information regarding the tumor location. Six LLMs (GPT-5, GPT-4.5, GPT-4.1, OpenAI-o3, -o1, and GPT-4 Turbo) and 4 blinded human evaluators (a nuclear medicine specialist, a gastrointestinal surgeon, and 2 radiology residents) assessed the presence of thoracic and abdominal LN metastases on a region-level basis and determined cN and cM staging on a patient-level basis. The model analyses were performed using the application programming interface in a zero-shot setting. Radiology reports served as the reference standard. Diagnostic agreement and accuracy were evaluated using Cohen κ and the Cochran Q test. Additionally, to account for the class imbalance in the dataset, the Matthews Correlation Coefficient was calculated as a robust metric for binary classification performance. Post hoc McNemar tests were performed with Bonferroni correction; statistical significance for pairwise comparisons was set at P<.0083 (adjusted from P<.05) using JMP Pro (version 18.0; SAS Institute Inc).

The average accuracy was 41/120 (34%) to 94/120 (78%) for LLMs and 72/120 (60%) to 102/120 (85%) for physicians, with significantly higher accuracy for physicians (P<.05) in the thoracic LN, abdominal LN, and cN stages. Interrater reliability was slight to fair for LLMs (κ: –0.07 to 0.25) and fair to substantial for physicians (κ: 0.27 to 0.74). Matthews Correlation Coefficient scores were consistently higher for physicians (0.28 to 0.75) than for LLMs (–0.07 to 0.32). Among the LLMs, GPT-5 demonstrated the highest overall accuracy, with newer LLMs showing improved diagnostic accuracy when compared with previous models in identifying abdominal LN metastases and cM staging, though they showed weaker consistency for cN staging. For example, in thoracic LN detection, GPT-5 achieved 76/120 (63%) accuracy, whereas other LLMs achieved 72/120 (60%) or lower accuracy.

Although current LLMs have not yet reached physician-level accuracy in comprehensive staging, recent models show promise in assisting with specific diagnostic tasks.

## Linked entities

- **Chemicals:** 18F fluorodeoxyglucose (PubChem CID 68614), 18F FDG (PubChem CID 68614)
- **Diseases:** esophageal cancer (MONDO:0007576), esophageal squamous cell carcinoma (MONDO:0005580)

## Full-text entities

- **Genes:** MCC (MCC regulator of Wnt signaling pathway) [NCBI Gene 4163] {aka MCC1}, TENM1 (teneurin transmembrane protein 1) [NCBI Gene 10178] {aka ODZ1, ODZ3, TEN-M1, TEN1, TNM, TNM1}
- **Diseases:** Cancer (MESH:D009369), adenocarcinoma (MESH:D000230), esophageal SCC (MESH:D000077277), AI (MESH:C538142), LLMs (MESH:D007806), hcM (MESH:D000092183), SCC (MESH:D002294), bone metastases (MESH:D009362), hallucination (MESH:D006212), N (MESH:C536108), nodal (MESH:D013611), LN (MESH:D000072717), abdominal LN metastases (MESH:D000007), abdominal LN metastasis (MESH:D008207), cardiac accumulation (MESH:D006331), Stage (MESH:D062706), Esophageal Cancer (MESH:D004938), nonsquamous cell carcinoma (MESH:D002280), oncologic (MESH:D000072716)
- **Chemicals:** AX (MESH:D000658), blood glucose (MESH:D001786), AXN (-), 18F FDG (MESH:D019788), glucose (MESH:D005947)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12972682/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12972682/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/PMC12972682/full.md

---
Source: https://tomesphere.com/paper/PMC12972682