# Can GPT-5.0 Interpret Thyroid Ultrasound Images? A Comparative TI-RADS Analysis with an Expert Radiologist

**Authors:** Yunus Yasar, Sevde Nur Emir, Muhammet Rasit Er, Mustafa Demir

PMC · DOI: 10.3390/diagnostics16020313 · 2026-01-19

## TL;DR

This study compares GPT-5.0's ability to interpret thyroid ultrasound images with an expert radiologist using the TI-RADS system, finding that while it recognizes some features, it overestimates malignancy risk.

## Contribution

The study evaluates GPT-5.0's performance in thyroid ultrasound interpretation using TI-RADS criteria and compares it to an expert radiologist.

## Key findings

- GPT-5.0 showed substantial agreement with the radiologist for composition, shape, and margin but poor agreement for echogenic foci.
- GPT-5.0 had lower sensitivity and specificity compared to the radiologist, with more false positives in benign nodules.
- The model tends to overclassify nodules as malignant, suggesting a need for ultrasound-specific training.

## Abstract

Background/Objectives: Multimodal large language models (LLMs) may directly interpret medical images, including thyroid ultrasounds (USs). Whether these models can reliably assess thyroid nodules—where subtle echogenic and morphological details are critical—remains uncertain. The American College of Radiology (ACR) TI-RADS system provides a structured framework for benchmarking artificial intelligence. This study evaluates GPT-5.0’s ability to interpret thyroid US images according to TI-RADS criteria and contextualizes its performance relative to expert radiologist assessment, using FNA cytology as the reference standard. Methods: This retrospective study included 100 patients (mean age 49.8 ± 12.6 years; 72 women) with cytology-confirmed diagnoses: Bethesda II (benign) or Bethesda V–VI (malignant). Each nodule had longitudinal and transverse US images acquired with high-frequency linear probes. A board-certified radiologist (>10 years’ experience) and GPT-5.0 independently assessed TI-RADS features (composition, echogenicity, shape, margin, echogenic foci) and assigned final categories. Agreement was analyzed using Cohen’s κ, and diagnostic performance was calculated using TR4–TR5 as positive for malignancy. Results: Agreement was substantial for composition (κ = 0.62), shape (κ = 0.70), and margin (κ = 0.68); moderate for echogenicity (κ = 0.48); and poor for echogenic foci (κ = 0.12). GPT-5.0 demonstrated a systematic, risk-averse tendency to up-classify nodules, leading to increased TR4–TR5 assignments. Overall, the TI-RADS agreement was 58% (κ = 0.31). The radiologist showed superior diagnostic performance (sensitivity 89%, specificity 85%) compared with GPT-5.0 (sensitivity 67%, specificity 49%), largely driven by false-positive TR4 classifications among benign nodules. Conclusions: GPT-5.0 recognizes several high-level TI-RADS features but struggles with microcalcifications and tends to overestimate malignancy risk within a risk-stratification framework, limiting its standalone clinical use. Ultrasound-specific training and domain adaptation may enable meaningful adjunctive roles in thyroid nodule assessment.

## Linked entities

- **Diseases:** thyroid cancer (MONDO:0002108)

## Full-text entities

- **Diseases:** thyroid nodule (MESH:D016606), malignancy (MESH:D009369)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12840398/full.md

---
Source: https://tomesphere.com/paper/PMC12840398