# Performance of Large Language Models in Metabolic Bariatric Surgery: a Comparative Study

**Authors:** Hassan El-Masry, Mohamed Yasser El-mezayen, Bothaina Farouk, Abdelrahman M. Tawfik, Passant Saeed Sharsher, Basma Hesham Mohamed, Ahmed Abo Elmagd, Ali Khammas, Abdelrahman Nimeri, Ricardo V Cohen, Ahmed Abokhozima

PMC · DOI: 10.1007/s11695-025-08418-y · Obesity Surgery · 2025-12-11

## TL;DR

This study evaluates how well large language models perform in answering questions about metabolic bariatric surgery, finding moderate accuracy but room for improvement.

## Contribution

The study provides the first comparative analysis of LLM performance in the specialized field of metabolic bariatric surgery.

## Key findings

- ChatGPT-4o had the highest accuracy at 66.0%, while DeepSeek had the lowest at 60.0%.
- LLMs performed best in guideline-based domains like indications/contraindications and complications/management.
- Binary questions yielded higher accuracy than multiple-choice questions.

## Abstract

The rapid integration of Large Language Models (LLMs) into healthcare necessitates a rigorous evaluation of their performance in specialized medical fields. In metabolic bariatric surgery (MBS), LLMs have the potential to revolutionize education and clinical support, yet their accuracy and reliability are not well-established. This study provides a critical assessment of the capabilities of current LLMs in the context of MBS.

This cross-sectional validation study assessed the performance of six LLMs (ChatGPT-3.5, ChatGPT-4o, Gemini, Copilot, GROK, and DeepSeek) in answering 100 evidence-based binary and multiple-choice questions related to MBS. Questions were constructed from international guidelines and categorized into six thematic domains. Expert consensus answers served as the reference standard, with inter-rater reliability measured using Fleiss’ κ. Model outputs were scored for accuracy. Comparisons across LLMs were first assessed using an overall test for differences between multiple related groups. Pairwise comparisons were then conducted between LLMs to identify specific differences in performance.

Across the dataset, the mean number of correct LLM responses per question was 3.9 (SD = 1.8). ChatGPT-4o achieved the highest accuracy (66.0%), while DeepSeek recorded the lowest (60.0%). Accuracy varied across domains, highest for indications/contraindications (78.7%) and complications/management (68.0%), and lowest for preoperative preparation (52.0%) and postoperative care (58.4%). Binary questions yielded higher accuracy (69.1%) than multiple-choice questions (62.0%). Inter-expert reliability was substantial (κ = 0.742, 95% CI: 0.71–0.77). Agreement between LLMs and experts ranged from fair (DeepSeek κ = 0.349) to moderate (ChatGPT-4o κ = 0.446). No significant accuracy differences were detected across models (Friedman test, p = 0.662).

LLMs represent a promising, yet imperfect, adjunct in MBS education. Their utility is currently limited by inconsistencies in accuracy, particularly in areas requiring nuanced clinical judgment. While these models can supplement traditional learning resources, they are not yet a substitute for expert clinical guidance. This study underscores the need for continued refinement and validation of LLMs to ensure their safe and effective integration into clinical practice.

The online version contains supplementary material available at 10.1007/s11695-025-08418-y.

LLMs show moderate accuracy in bariatric surgery education, strongest in guideline-based domains.

Newer models (ChatGPT-4o, Gemini, Copilot) performed slightly better, but gains were modest.

Accuracy was higher for binary than multiple-choice questions.

The online version contains supplementary material available at 10.1007/s11695-025-08418-y.

## Full-text entities

- **Diseases:** type II diabetes (MESH:D003924), dumping syndrome (MESH:D004377), leaks (MESH:D019559), LLMs (MESH:D007806), Obesity (MESH:D009765), autoimmune diseases (MESH:D001327), MBS (MESH:D008659), Obesity and Metabolic Disorders (MESH:D000067329), complication (MESH:D008107), Hyperlipidemia (MESH:D006949)
- **Chemicals:** caffeine (MESH:D002110), IFSO (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12957009/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12957009/full.md

## References

2 references — full list in the complete paper: https://tomesphere.com/paper/PMC12957009/full.md

---
Source: https://tomesphere.com/paper/PMC12957009