# ChatGPT-4o, Gemini Advanced and DeepSeek R1 in preoperative decision-making for thyroid surgery: a comparative assessment with human surgeons

**Authors:** Long Zou, Peng Zhang, Yu-qi Jiang, Xiao-wen Wang, Xi-jing Yan, Jie-zhong Wu, Jia Qi, Wen-chao Li, Qing-qing Cai, Zhi-rong Xuan, Kun-peng Hu

PMC · DOI: 10.3389/fonc.2025.1590230 · Frontiers in Oncology · 2025-10-24

## TL;DR

This study compares AI models and human surgeons in preoperative planning for thyroid surgery, finding that some models perform reasonably well but still need improvement.

## Contribution

The study provides a comparative assessment of three LLMs in thyroid surgery decision-making against human experts.

## Key findings

- ChatGPT-4o and DeepSeek R1 showed substantial agreement with surgeons in preoperative planning.
- Gemini Advanced had the lowest concordance rates, especially in lymph node dissection planning.
- Model-specific variability highlights the need for refinement and clinical validation before adoption.

## Abstract

The integration of large language models (LLMs) into surgical decision-making is an emerging field with potential clinical value. This study assessed the preoperative decision-making consistency of ChatGPT-4o, Gemini Advanced, and DeepSeek R1 in comparison with expert consensus, using clinical data from 123 patients undergoing thyroid surgery. Overall concordance rates were 47.97% for ChatGPT-4o, 24.39% for Gemini Advanced, and 56.10% for DeepSeek R1. In thyroidectomy extent decisions, all three models showed moderate consistency with the surgical team, with agreement rates of 61.79% (κ=0.484) for ChatGPT-4o, 67.48% (κ=0.548) for Gemini, and 67.48% (κ=0.535) for DeepSeek R1 (all p < 0.001). However, significant divergence was observed in lymph node dissection planning: ChatGPT-4o achieved a high concordance rate of 69.11% (κ=0.616), DeepSeek R1 showed the highest at 79.67% (κ=0.741), while Gemini’s performance was relatively poor at 34.96% (κ=0.188). Though our findings demonstrate that ChatGPT-4o and DeepSeek R1 exhibit substantial agreement with experienced surgeons in preoperative planning, overall performance still leaves room for improvement. Nevertheless, model-specific variability—particularly in oncologic decision-making—highlights the need for refinement and robust clinical validation before widespread clinical adoption.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12591979/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12591979/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/PMC12591979/full.md

---
Source: https://tomesphere.com/paper/PMC12591979