# Interobserver agreement between artificial intelligence models in the thyroid imaging and reporting data system (TIRADS) assessment of thyroid nodules

**Authors:** Andrea Leoncini, Pierpaolo Trimboli

PMC · DOI: 10.1007/s12020-025-04272-1 · Endocrine · 2025-05-15

## TL;DR

This study compares how consistently different AI models assess thyroid nodule risk using various thyroid imaging systems.

## Contribution

The study introduces a novel evaluation of interobserver agreement among AI models in thyroid nodule risk assessment using TIRADS systems.

## Key findings

- AI models showed significant variability in risk assessment across different TIRADS systems.
- Gemini and Claude demonstrated the highest agreement under ACR-TIRADS and K-TIRADS.
- ChatGPT showed lower agreement with other AIs in most TIRADS evaluations.

## Abstract

As ultrasound (US) is the most accurate tool for assessing the thyroid nodule (TN) risk of malignancy (RoM), international societies have published various Thyroid Imaging and Reporting Data Systems (TIRADSs). With the recent advent of artificial intelligence (AI), clinicians and researchers should ask themselves how AI could interpret the terminology of the TIRADSs and whether or not AIs agree in the risk assessment of TNs. The study aim was to analyze the interobserver agreement (IOA) between AIs in assessing the RoM of TNs across various TIRADSs categories using a cases series created combining TIRADSs descriptors.

ChatGPT, Google Gemini, and Claude were compared. ACR-TIRADS, EU-TIRADS, and K-TIRADS, were employed to evaluate the AI assessment. Multiple written scenarios for the three TIRADS were created, the cases were evaluated by the three AIs, and their assessments were analyzed and compared. The IOA was estimated by comparing the kappa (κ) values.

Ninety scenarios were created. With ACR-TIRADS the IOA analysis gave κ = 0.58 between ChatGPT and Gemini, 0.53 between ChatGPT and Claude, and 0.90 between Gemini and Claude. With EU-TIRADS it was observed κ value = 0.73 between ChatGPT and Gemini, 0.62 between ChatGPT and Claude, and 0.72 between Gemini and Claude. With K-TIRADS it was found κ = 0.88 between ChatGPT and Gemini, 0.70 between ChatGPT and Claude, and 0.61 between Gemini and Claude.

This study found that there were non-negligible variability between the three AIs. Clinicians and patients should be aware of these new findings.

## Full-text entities

- **Diseases:** TNs (MESH:C562719), RoM (MESH:D009369), TN (MESH:D016606)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12227502/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12227502/full.md

---
Source: https://tomesphere.com/paper/PMC12227502