LLM evaluation for thyroid nodule assessment: comparing ACR-TIRADS, C-TIRADS, and clinician-AI trust gap

Xi Dai; Yu Xi; Yong Hu; Qingyan Ding; Yu Zhang; Hui Liu; Piaofei Chen; Xi Wang; Wenjun Wang; Chaoxue Zhang

PMC · DOI:10.3389/fendo.2025.1667809·September 29, 2025

LLM evaluation for thyroid nodule assessment: comparing ACR-TIRADS, C-TIRADS, and clinician-AI trust gap

Xi Dai, Yu Xi, Yong Hu, Qingyan Ding, Yu Zhang, Hui Liu, Piaofei Chen, Xi Wang, Wenjun Wang, Chaoxue Zhang

PDF

Open Access

TL;DR

This study compares how well advanced AI models can assess thyroid nodules and align with clinical guidelines, finding that while one model is most accurate, another is most trusted by clinicians.

Contribution

The novel contribution is evaluating LLMs for thyroid nodule assessment using ACR-TIRADS and C-TIRADS frameworks and measuring clinician trust in AI outputs.

Findings

01

GPT-4o achieved the highest AUC (0.898) under C-TIRADS, nearing expert-level accuracy.

02

DeepSeek-R1 received highest clinician trust ratings (mean Likert 4.65) under C-TIRADS.

03

Clinicians consistently favored C-TIRADS over ACR-TIRADS for all models.

Abstract

To evaluate the diagnostic performance and clinical utility of advanced large language models (LLMs) -GPT-4o, GPT-o3-mini, and DeepSeek-R1- in stratifying thyroid nodule malignancy risk and generating guideline-aligned management recommendations based on structured narrative ultrasound descriptions. This diagnostic modeling study evaluated three LLMs—GPT-4o, GPT-o3-mini, and DeepSeek-R1—using standardized narrative ultrasound descriptors. These descriptors were annotated by consensus among three senior board-certified sonologists and processed independently in a stateless manner to ensure unbiased outputs. LLM outputs were assessed under both ACR-TIRADS and C-TIRADS frameworks. Two experienced clinicians (a thyroid surgeon and an endocrinologist) independently rated the outputs across five clinical dimensions using 5-point Likert scales. Primary outcomes included the area under the…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases2

thyroid nodule malignancy

Figures4

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsThyroid Cancer Diagnosis and Treatment · Artificial Intelligence in Healthcare and Education · Meta-analysis and systematic reviews