# LLM evaluation for thyroid nodule assessment: comparing ACR-TIRADS, C-TIRADS, and clinician-AI trust gap

**Authors:** Xi Dai, Yu Xi, Yong Hu, Qingyan Ding, Yu Zhang, Hui Liu, Piaofei Chen, Xi Wang, Wenjun Wang, Chaoxue Zhang

PMC · DOI: 10.3389/fendo.2025.1667809 · 2025-09-29

## TL;DR

This study compares how well advanced AI models can assess thyroid nodules and align with clinical guidelines, finding that while one model is most accurate, another is most trusted by clinicians.

## Contribution

The novel contribution is evaluating LLMs for thyroid nodule assessment using ACR-TIRADS and C-TIRADS frameworks and measuring clinician trust in AI outputs.

## Key findings

- GPT-4o achieved the highest AUC (0.898) under C-TIRADS, nearing expert-level accuracy.
- DeepSeek-R1 received highest clinician trust ratings (mean Likert 4.65) under C-TIRADS.
- Clinicians consistently favored C-TIRADS over ACR-TIRADS for all models.

## Abstract

To evaluate the diagnostic performance and clinical utility of advanced large language models (LLMs) -GPT-4o, GPT-o3-mini, and DeepSeek-R1- in stratifying thyroid nodule malignancy risk and generating guideline-aligned management recommendations based on structured narrative ultrasound descriptions.

This diagnostic modeling study evaluated three LLMs—GPT-4o, GPT-o3-mini, and DeepSeek-R1—using standardized narrative ultrasound descriptors. These descriptors were annotated by consensus among three senior board-certified sonologists and processed independently in a stateless manner to ensure unbiased outputs. LLM outputs were assessed under both ACR-TIRADS and C-TIRADS frameworks. Two experienced clinicians (a thyroid surgeon and an endocrinologist) independently rated the outputs across five clinical dimensions using 5-point Likert scales. Primary outcomes included the area under the receiver operating characteristic curve (AUC) for malignancy prediction, and clinician ratings of guideline adherence, patient safety, operational feasibility, clinical applicability, and overall performance.

GPT-4o achieved the highest predictive AUC (0.898) under C-TIRADS, approaching expert-level accuracy. DeepSeek-R1, particularly with C-TIRADS, received the highest clinician ratings (mean Likert: surgeon 4.65, endocrinologist 4.63), reflecting greater trust in its practical recommendations. Clinicians consistently favored the C-TIRADS framework across all models. GPT-4o and GPT-o3-mini received lower ratings in trustworthiness and recommendation quality, especially from the endocrinologist.

While GPT-4o demonstrated superior diagnostic accuracy, clinicians most trusted DeepSeek-R1 combined with the C-TIRADS framework for generating practical, guideline-consistent recommendations. The findings highlight the critical need for alignment between AI-generated outputs and clinician expectations, and the importance of incorporating region-specific clinical guidelines (like C-TIRADS) for the effective real-world implementation of LLMs in thyroid nodule management decision support.

## Full-text entities

- **Diseases:** thyroid nodule (MESH:D016606), malignancy (MESH:D009369)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12515656/full.md

---
Source: https://tomesphere.com/paper/PMC12515656