# Postoperative complication management: How do large language models measure up to human expertise?

**Authors:** Sophie-Caroline Schwarzkopf, Jean-Paul Bereuter, Mark Enrik Geissler, Jürgen Weitz, Marius Distler, Fiona R. Kolbinger, Mathew Kiang, Mathew Kiang, Mathew Kiang

PMC · DOI: 10.1371/journal.pdig.0000933 · PLOS Digital Health · 2025-08-01

## TL;DR

This study compares how well large language models and human experts manage postoperative complications in surgical care.

## Contribution

The study evaluates the performance of three LLMs in managing postoperative complications against human surgical caregivers.

## Key findings

- GPT-4 showed high accuracy in identifying postoperative complications (96.7%) comparable to humans (76.3%).
- GPT-3 and GPT-4 provided comprehensive management plans, while Gemini-Advanced often failed to provide recommendations.
- LLMs demonstrated potential to support and augment surgical routine care.

## Abstract

Managing postoperative complications is an essential part of surgical care and largely depends on the medical team’s experience. Large Language Models (LLMs) have demonstrated immense potential in supporting medical professionals. To evaluate the potential of LLMs in surgical patient care, we compared the performance of three state-of-the-art LLMs in managing postoperative complications to that of a panel of medical professionals based on six postsurgical patient cases. Six realistic postoperative patient cases were queried using GPT-3, GPT-4, and Gemini-Advanced and presented to human surgical caregivers. Humans and LLMs provided a triage assessment, an initial suspected diagnosis, and an acute management plan, including initial diagnostic and therapeutic measures. Responses were compared based on medical contextual correctness, coherence, and completeness. In comparison to human caregivers, GPT-3 and GPT-4 possess considerable competencies in correctly identifying postoperative complications (humans: 76.3% vs. GPT-3: 75.0% vs. GPT-4: 96.7%, p = 0.47) as well as triaging patients accordingly (humans: 84.8% vs. GPT-3: 50% vs. GPT-4: 38.3%, p = 0.19). With regard to diagnostic and therapeutic management of postoperative complications, GPT-3 and GPT-4 provided comprehensive management plans. Gemini-Advanced often provided no diagnostic or therapeutic recommendations and censored its outputs. In summary, LLMs can accurately interpret postoperative care scenarios and provide comprehensive management recommendations. These results showcase the improvements in LLMs performance with regard to postoperative surgical use cases and provide evidence for their potential value to support and augment surgical routine care.

## Full-text entities

- **Diseases:** Postoperative complication (MESH:D011183), postoperative (MESH:D019106)
- **Chemicals:** Gemini (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12316209/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12316209/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/PMC12316209/full.md

---
Source: https://tomesphere.com/paper/PMC12316209