# Artificial Intelligence as a Support Tool for Preoperative Patient Education in Anesthesiology: A Comparative Evaluation of Five Large Language Models

**Authors:** Ahmet Tuğrul Şahin, Mehtap Gürler Balta, Vildan Kölükçü, Ali Genç, Serkan Karaman, Tuğba Karaman, Hakan Tapar

PMC · DOI: 10.3390/jcm15062197 · Journal of Clinical Medicine · 2026-03-13

## TL;DR

This study compares five AI tools to see how well they can help educate patients about anesthesia, finding that their performance varies significantly and they should be used carefully.

## Contribution

The study provides a multidimensional evaluation of five large language models for preoperative patient education in anesthesiology, highlighting model-specific and context-dependent performance differences.

## Key findings

- ChatGPT achieved the highest overall performance among the evaluated models.
- Gemini demonstrated superior accuracy in answering anesthesiology-related questions.
- Model performance varied significantly across different anesthesiology subspecialties.

## Abstract

Background/Objectives: Large language models (LLMs) are increasingly used for patient education, yet comparative evidence regarding their accuracy, safety, and ethical performance remains limited, particularly in high-risk fields such as anesthesiology. This study aimed to conduct a multidimensional comparison of five contemporary LLMs in answering common patient questions in anesthesiology. Methods: In this cross-sectional, comparative in silico study, 30 standardized patient questions covering general anesthesia, spinal/epidural anesthesia, and peripheral nerve blocks were submitted to ChatGPT, Gemini, Microsoft Copilot, DeepSeek, and Grok. Responses were independently evaluated under full blinding by five senior anesthesiology professors using a 5-point Likert scale across six domains: accuracy, safety, completeness, understandability, ethics, and overall assessment. Inter-rater reliability was assessed using intraclass correlation coefficients (ICC). Performance differences were analyzed using linear mixed-effects models accounting for question- and evaluator-level variability, with results reported as estimated marginal means. Results: Inter-rater agreement was good to excellent across all domains (ICC > 0.75). Significant model-related differences were observed for overall assessment, accuracy, safety, completeness, and ethics (all p < 0.001), whereas understandability did not differ significantly between models. ChatGPT achieved the highest overall performance, while Gemini demonstrated superior accuracy. Model performance varied across anesthesiology subspecialties, with significant model × topic interactions identified in multiple domains (p < 0.01). Conclusions: LLMs may serve as supportive tools for patient education in anesthesiology; however, their performance varies substantially across models and clinical contexts. Differences in accuracy, safety, and ethical performance highlight the need for cautious, context-aware integration of LLMs into clinical practice rather than their use as substitutes for anesthesiologists’ clinical judgment.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13026539/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13026539/full.md

## References

17 references — full list in the complete paper: https://tomesphere.com/paper/PMC13026539/full.md

---
Source: https://tomesphere.com/paper/PMC13026539